简体   繁体   English

Go 中的垃圾收集和指针的正确使用

[英]Garbage collection and correct usage of pointers in Go

I come from a Python/Ruby/JavaScript background.我来自 Python/Ruby/JavaScript 背景。 I understand how pointers work, however, I'm not completely sure how to leverage them in the following situation.我了解指针的工作原理,但是,我不完全确定如何在以下情况下利用它们。

Let's pretend we have a fictitious web API that searches some image database and returns a JSON describing what's displayed in each image that was found:假设我们有一个虚构的 Web API,它搜索某个图像数据库并返回一个 JSON,描述找到的每个图像中显示的内容:

[
    {
        "url": "https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg",
        "description": "Ocean islands",
        "tags": [
            {"name":"ocean", "rank":1},
            {"name":"water", "rank":2},
            {"name":"blue", "rank":3},
            {"name":"forest", "rank":4}
        ]
    },

    ...

    {
        "url": "https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg",
        "description": "Bridge over river",
        "tags": [
            {"name":"bridge", "rank":1},
            {"name":"river", "rank":2},
            {"name":"water", "rank":3},
            {"name":"forest", "rank":4}
        ]
    }
]

My goal is to create a data structure in Go that will map each tag to a list of image URLs that would look like this:我的目标是在 Go 中创建一个数据结构,将每个标签映射到一个图像 URL 列表,如下所示:

{
    "ocean": [
        "https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg"
    ],
    "water": [
        "https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg",
        "https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg"
    ],
    "blue": [
        "https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg"
    ],
    "forest":[
        "https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg", 
        "https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg"
    ],
    "bridge": [
        "https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg"
    ],
    "river":[
        "https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg"
    ]
}

As you can see, each image URL can belong to multiple tags at the same time.如您所见,每个图像 URL 可以同时属于多个标签。 If I have thousands of images and even more tags, this data structure can grow very large if image URL strings are copied by value for each tag.如果我有数以千计的图像和更多的标签,如果按每个标签的值复制图像 URL 字符串,则此数据结构会变得非常大。 This is where I want to leverage pointers.这是我想利用指针的地方。

I can represent the JSON API response by two structs in Go, func searchImages() mimics the fake API:我可以用 Go 中的两个结构来表示 JSON API 响应, func searchImages()模仿了假 API:

package main

import "fmt"


type Image struct {
    URL string
    Description string
    Tags []*Tag
}

type Tag struct {
    Name string
    Rank int
}

// this function mimics json.NewDecoder(resp.Body).Decode(&parsedJSON)
func searchImages() []*Image {
    parsedJSON := []*Image{
        &Image {
            URL: "https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg",
            Description: "Ocean islands",
            Tags: []*Tag{
                &Tag{"ocean", 1},
                &Tag{"water", 2},
                &Tag{"blue", 3},
                &Tag{"forest", 4},
            }, 
        },
        &Image {
            URL: "https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg",
            Description: "Bridge over river",
            Tags: []*Tag{
                &Tag{"bridge", 1},
                &Tag{"river", 2},
                &Tag{"water", 3},
                &Tag{"forest", 4},
            }, 
        },
    }
    return parsedJSON
}

Now the less optimal mapping function that results in a very large in-memory data structure can look like this:现在,导致内存中数据结构非常大的不太理想的映射函数可能如下所示:

func main() {
    result := searchImages()

    tagToUrlMap := make(map[string][]string)

    for _, image := range result {
        for _, tag := range image.Tags {
            // fmt.Println(image.URL, tag.Name)
            tagToUrlMap[tag.Name] = append(tagToUrlMap[tag.Name], image.URL)
        }
    }

    fmt.Println(tagToUrlMap)
}

I can modify it to use pointers to the Image struct URL field instead of copying it by value:我可以修改它以使用指向Image struct URL字段的指针,而不是按值复制它:

    // Version 1

    tagToUrlMap := make(map[string][]*string)

    for _, image := range result {
        for _, tag := range image.Tags {
            // fmt.Println(image.URL, tag.Name)
            tagToUrlMap[tag.Name] = append(tagToUrlMap[tag.Name], &image.URL)
        }
    }

It works and my first question is what happens to the result data structure after I build the mapping in this way?它有效,我的第一个问题是在我以这种方式构建映射后result数据结构会发生什么? Will the Image URL string fields be left in memory somehow and the rest of the result will be garbage collected? Image URL字符串字段是否会以某种方式留在内存中,而其余的result将被垃圾收集? Or will the result data structure stay in memory until the end of the program because something points to its members?或者result数据结构会一直留在内存中直到程序结束,因为某些东西指向它的成员?

Another way to do this would be to copy the URL to an intermediate variable and use a pointer to it instead:另一种方法是将 URL 复制到中间变量并使用指向它的指针:

    // Version 2

    tagToUrlMap := make(map[string][]*string)

    for _, image := range result {
        imageUrl = image.URL
        for _, tag := range image.Tags {
            // fmt.Println(image.URL, tag.Name)    
            tagToUrlMap[tag.Name] = append(tagToUrlMap[tag.Name], &imageUrl)
        }
    }

Is this better?这是否更好? Will the result data structure be garbage collected correctly? result数据结构会被正确垃圾回收吗?

Or perhaps I should use a pointer to string in the Image struct instead?或者我应该在Image结构中使用指向字符串的指针?

type Image struct {
    URL *string
    Description string
    Tags []*Tag
}

Is there a better way to do this?有一个更好的方法吗? I would also appreciate any resources on Go that describe various uses of pointers in depth.我也很欣赏 Go 上任何深入描述指针的各种用途的资源。 Thanks!谢谢!

https://play.golang.org/p/VcKWUYLIpH7 https://play.golang.org/p/VcKWUYLIpH7

UPDATE: I'm worried about optimal memory consumption and not generating unwanted garbage the most.更新:我担心最佳内存消耗而不是最生成不需要的垃圾。 My goal is to use the minimal amount of memory possible.我的目标是尽可能使用最少的内存。

Foreword: I released the presented string pool in my github.com/icza/gox library, see stringsx.Pool .前言:我在我的github.com/icza/gox库中发布了呈现的字符串池,参见stringsx.Pool


First some background.首先介绍一下背景。 string values in Go are represented by a small struct-like data structure reflect.StringHeader : Go 中的string值由类似结构的小型数据结构reflect.StringHeader

type StringHeader struct {
        Data uintptr
        Len  int
}

So basically passing / copying a string value passes / copies this small struct value, which is 2 words only regardless of the length of the string .所以基本上传递/复制一个string值传递/复制这个小的结构值,无论string的长度如何,它都是2个字。 On 64-bit architectures, it's only 16 bytes, even if the string has a thousand characters.在 64 位体系结构上,即使string有一千个字符,它也只有 16 个字节。

So basically string values already act as pointers.所以基本上string值已经充当了指针。 Introducing another pointer like *string just complicates usage, and you won't really gain any noticable memory.引入另一个像*string这样的指针只会使使用复杂化,并且您不会真正获得任何显着的内存。 For the sake of memory optimization, forget about using *string .为了内存优化,忘记使用*string

It works and my first question is what happens to the result data structure after I build the mapping in this way?它有效,我的第一个问题是在我以这种方式构建映射后结果数据结构会发生什么? Will the Image URL string fields be left in memory somehow and the rest of the result will be garbage collected?图像 URL 字符串字段是否会以某种方式留在内存中,而其余的结果将被垃圾收集? Or will the result data structure stay in memory until the end of the program because something points to its members?或者结果数据结构会一直留在内存中直到程序结束,因为某些东西指向它的成员?

If you have a pointer value pointing to a field of a struct value, then the whole struct will be kept in memory, it can't be garbage collected.如果你有一个指针值指向一个结构体值的一个字段,那么整个结构体将被保存在内存中,它不能被垃圾回收。 Note that although it could be possible to release memory reserved for other fields of the struct, but the current Go runtime and garbage collector does not do so.请注意,虽然可以释放为结构体的其他字段保留的内存,但当前的 Go 运行时和垃圾收集器不会这样做。 So to achieve optimal memory usage, you should forget about storing addresses of struct fields (unless you also need the complete struct values, but still, storing field addresses and slice/array element addresses always requires care).因此,为了实现最佳内存使用,您应该忘记存储结构字段的地址(除非您还需要完整的结构值,但仍然需要小心存储字段地址和切片/数组元素地址)。

The reason for this is because memory for struct values are allocated as a contiguous segment, and so keeping only a single referenced field would strongly fragment the available / free memory, and would make optimal memory management even harder and less efficient.这样做的原因是因为 struct 值的内存被分配为一个连续的段,因此只保留一个引用字段会强烈地分割可用/空闲内存,并使最佳内存管理更加困难和效率低下。 Defragmenting such areas would also require copying the referenced field's memory area, which would require "live-changing" pointer values (changing memory addresses).对这些区域进行碎片整理还需要复制引用字段的内存区域,这将需要“实时更改”指针值(更改内存地址)。

So while using pointers to string values may save you some tiny memory, the added complexity and additional indirections make it unworthy.因此,虽然使用指向string值的指针可能会为您节省一些很小的内存,但增加的复杂性和额外的间接性使其不值得。

So what to do then?那该怎么办呢?

"Optimal" solution “最优”解决方案

So the cleanest way is to keep using string values.所以最干净的方法是继续使用string值。

And there is one more optimization we didn't talk about earlier.还有一个我们之前没有提到的优化。

You get your results by unmarshaling a JSON API response.您可以通过解组 JSON API 响应来获得结果。 This means that if the same URL or tag value is included multiple times in the JSON response, different string values will be created for them.这意味着如果在 JSON 响应中多次包含相同的 URL 或标记值,将为它们创建不同的string值。

What does this mean?这是什么意思? If you have the same URL twice in the JSON response, after unmarshaling, you will have 2 distinct string values which will contain 2 different pointers pointing to 2 different allocated byte sequences (string content which otherwise will be the same).如果您在 JSON 响应中有两次相同的 URL,在解组后,您将有 2 个不同的string值,其中将包含 2 个不同的指针,指向 2 个不同的已分配字节序列(否则字符串内容将相同)。 The encoding/json package does not do string interning . encoding/json包不做string实习

Here's a little app that proves this:这是一个证明这一点的小应用程序:

var s []string
err := json.Unmarshal([]byte(`["abc", "abc", "abc"]`), &s)
if err != nil {
    panic(err)
}

for i := range s {
    hdr := (*reflect.StringHeader)(unsafe.Pointer(&s[i]))
    fmt.Println(hdr.Data)
}

Output of the above (try it on the Go Playground ):上面的输出(在Go Playground上试试):

273760312
273760315
273760320

We see 3 different pointers.我们看到 3 个不同的指针。 They could be the same, as string values are immutable.它们可能相同,因为string值是不可变的。

The json package does not detect repeating string values because the detection adds memory and computational overhead, which is obviously something unwanted. json包不会检测重复的string值,因为检测会增加内存和计算开销,这显然是不需要的。 But in our case we shoot for optimal memory usage, so an "initial", additional computation does worth the big memory gain.但在我们的例子中,我们追求最佳内存使用,因此“初始”额外计算确实值得大内存增益。

So let's do our own string interning.所以让我们做我们自己的字符串实习。 How to do that?怎么做?

After unmarshaling the JSON result, during building the tagToUrlMap map, let's keep track of string values we have come across, and if the subsequent string value has been seen earlier, just use that earlier value (its string descriptor).解组 JSON 结果后,在构建tagToUrlMap映射期间,让我们跟踪我们遇到的string值,如果之前已经看到后续string值,只需使用该早期值(其字符串描述符)。

Here's a very simple string interner implementation:这是一个非常简单的字符串内部实现:

var cache = map[string]string{}

func interned(s string) string {
    if s2, ok := cache[s]; ok {
        return s2
    }
    // New string, store it
    cache[s] = s
    return s
}

Let's test this "interner" in the example code above:让我们在上面的示例代码中测试这个“内部人员”:

var s []string
err := json.Unmarshal([]byte(`["abc", "abc", "abc"]`), &s)
if err != nil {
    panic(err)
}

for i := range s {
    hdr := (*reflect.StringHeader)(unsafe.Pointer(&s[i]))
    fmt.Println(hdr.Data, s[i])
}

for i := range s {
    s[i] = interned(s[i])
}

for i := range s {
    hdr := (*reflect.StringHeader)(unsafe.Pointer(&s[i]))
    fmt.Println(hdr.Data, s[i])
}

Output of the above (try it on the Go Playground ):上面的输出(在Go Playground上试试):

273760312 abc
273760315 abc
273760320 abc
273760312 abc
273760312 abc
273760312 abc

Wonderful!精彩的! As we can see, after using our interned() function, only a single instance of the "abc" string is used in our data structure (which is actually the first occurrence).正如我们所见,在使用我们的interned()函数之后,我们的数据结构中只使用了"abc"字符串的一个实例(实际上是第一次出现)。 This means all other instances (given no one else uses them) can be–and will be–properly garbage collected (by the garbage collector, some time in the future).这意味着所有其他实例(假设没有其他人使用它们)可以并且将被正确垃圾收集(由垃圾收集器,在未来的某个时间)。

One thing to not forget here: the string interner uses a cache dictionary which stores all previously encountered string values.这里不要忘记的一件事是:字符串内部使用一个cache字典来存储所有以前遇到的字符串值。 So to let those strings go, you should "clear" this cache map too, simplest done by assigning a nil value to it.所以为了让这些字符串消失,你也应该“清除”这个缓存映射,最简单的方法是为其分配一个nil值。

Without further ado, let's see our solution:事不宜迟,让我们看看我们的解决方案:

result := searchImages()

tagToUrlMap := make(map[string][]string)

for _, image := range result {
    imageURL := interned(image.URL)

    for _, tag := range image.Tags {
        tagName := interned(tag.Name)
        tagToUrlMap[tagName] = append(tagToUrlMap[tagName], imageURL)
    }
}

// Clear the interner cache:
cache = nil

To verify the results:要验证结果:

enc := json.NewEncoder(os.Stdout)
enc.SetIndent("", "  ")
if err := enc.Encode(tagToUrlMap); err != nil {
    panic(err)
}

Output is (try it on the Go Playground ):输出是(在Go Playground上试试):

{
  "blue": [
    "https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg"
  ],
  "bridge": [
    "https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg"
  ],
  "forest": [
    "https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg",
    "https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg"
  ],
  "ocean": [
    "https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg"
  ],
  "river": [
    "https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg"
  ],
  "water": [
    "https://c8.staticflickr.com/4/3707/11603200203_87810ddb43_o.jpg",
    "https://c3.staticflickr.com/1/48/164626048_edeca27ed7_o.jpg"
  ]
}

Further memory optimizations:进一步的内存优化:

We used the builtin append() function to add new image URLs to tags.我们使用内置的append()函数向标签添加新的图像 URL。 append() may (and usually does) allocate bigger slices than needed (thinking of future growth). append()可能(并且通常确实)分配比需要更大的切片(考虑未来的增长)。 After our "build" process, we may go through our tagToUrlMap map and "trim" those slices to the minimum needed.在我们的“构建”过程之后,我们可以通过我们的tagToUrlMap映射并将这些切片“修剪”到所需的最小值。

This is how it could be done:这是如何做到的:

for tagName, urls := range tagToUrlMap {
    if cap(urls) > len(urls) {
        urls2 := make([]string, len(urls))
        copy(urls2, urls)
        tagToUrlMap[tagName] = urls2
    }
}

Will the [...] be garbage collected correctly? [...] 会被正确垃圾回收吗?

Yes.是的。

You never need to worry that something will be collected which is still in use and you can rely on everything being collected once it is no longer used.您永远不必担心会收集仍在使用的东西,一旦不再使用,您就可以依靠收集的所有东西。

So the question about GC is never "Will it be collected correctly?"所以关于 GC 的问题永远不是“它会被正确收集吗?” but "Do I generate unnecessary garbage?".但是“我会产生不必要的垃圾吗?”。 Now this actual question does not depend that much on the data structure than on the amount of neu objects created (on the heap).现在这个实际问题并不取决于数据结构,而是取决于创建的 neu 对象的数量(在堆上)。 So this is a question about how the data structures are used and much less on the structure itself.所以这是一个关于如何使用数据结构的问题,而不是关于结构本身的问题。 Use benchmarks and run go test with -benchmem.使用基准测试并使用 -benchmem 运行 go test。

(High end performance might also consider how much work the GC has to do: Scanning pointers might take time. Forget that for now.) (高端性能可能还会考虑 GC 需要做多少工作:扫描指针可能需要时间。暂时忘记这一点。)

The other relevant question is about memory consumption .另一个相关问题是关于内存消耗 Copying a string copies just three words while copying a *string copies one word.复制字符串只复制三个单词,而复制 *string 复制一个单词。 So there is not much to safe here by using *string.因此,这里使用 *string 并没有什么安全措施。

So unfortunately there are no clear answers to the relevant questions (amount of garbage generated and total memory consumption).所以不幸的是,相关问题(产生的垃圾量和总内存消耗)没有明确的答案。 Don't overthink the problem, use what fits your purpose, measure and refactor.不要过度考虑问题,使用适合您的目的,衡量和重构。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM