Go语言的坑: String 相关

[toc]

迭代带来的问题

在 Go 语言中，字符串是一种基本类型，默认是通过 utf8 编码的字符序列，当字符为 ASCII 码时则占用 1 个字节，其他字符根据需要占用 2-4 个字节，比如中文编码通常需要 3 个字节。

那么我们在做 string 迭代的时候可能会产生意想不到的问题：

s := "hêllo"
  for i := range s {
    fmt.Printf("position %d: %c\n", i, s[i])
  }
  fmt.Printf("len=%d\n", len(s))

输出：

position 0: h
position 1: Ã
position 3: l
position 4: l
position 5: o
len=6

上面的输出中发现第二个字符是 Ã，不是 ê，并且位置2的输出”消失“了，这其实就是因为 ê 在 utf8 里面实际上占用 2 个 byte：

s	h	ê	l	l	o
[]byte(s)	68	c3 aa	6c	6c	6f

所以我们在迭代的时候 s[1] 等于 c3 这个 byte 等价 Ã 这个 utf8 值，所以输出的是 hÃllo 而不是 hêllo。

那么根据上面的分析，我们就可以知道在迭代获取字符的时候不能只获取单个 byte，应该使用 range 返回的 value值：

s := "hêllo"
  for i, v := range s {
    fmt.Printf("position %d: %c\n", i, v)  
  }

或者我们可以把 string 转成 rune 数组，在 go 中 rune 代表 Unicode码位，用它可以输出单个字符：

s := "hêllo"
  runes := []rune(s)
  for i, _ := range runes {
    fmt.Printf("position %d: %c\n", i, runes[i])  
  }

输出：

position 0: h
position 1: ê
position 2: l
position 3: l
position 4: o

截断带来的问题

Go 中在对slice使用：操作符进行截断的时候，底层的数组实际上指向同一个，在 string 里面也需要注意这个问题，比如下面：

func (s store) handleLog(log string) error {
            if len(log) < 36 {
                    return errors.New("log is not correctly formatted")
            }
            uuid := log[:36]
            s.store(uuid)
            // Do something    
}

这段代码用了：操作符进行截断，但是如果 log 这个对象很大，比如上面的 store 方法把 uuid 一直存在内存里，可能会造成底层的数组一直不释放，从而造成内存泄露。

为了解决这个问题，我们可以先复制一份再处理：

func (s store) handleLog(log string) error {
            if len(log) < 36 {
                    return errors.New("log is not correctly formatted")
            }
            uuid := strings.Clone(log[:36]) // copy一份
            s.store(uuid)
            // Do something    
}