Goroutines分享片：：试图了解数据竞赛

Question

我尝试在Go中编写一个程序，以在非常大的DNA序列文件中查找一些基因。 我已经制作了一个Perl程序来做到这一点，但是我想利用goroutines来并行执行此搜索；）

因为文件很大，所以我的想法是一次读取100个序列，然后将分析结果发送到goroutine，然后再次读取100个序列，依此类推。

我要感谢该站点的成员对切片和goroutine的真正有用的解释。

我进行了建议的更改，以使用goroutines处理过的切片的副本。 但是-race执行仍然在copy()函数级别检测到一个数据竞争：

非常感谢您的评论！

    ==================
WARNING: DATA RACE
Read by goroutine 6:
  runtime.slicecopy()
      /usr/lib/go-1.6/src/runtime/slice.go:113 +0x0
  main.main.func1()
      test_chan006.go:71 +0xd8

Previous write by main goroutine:
  main.main()
      test_chan006.go:63 +0x3b7

Goroutine 6 (running) created at:
  main.main()
      test_chan006.go:73 +0x4c9
==================
[>5HSAA098909 BA098909 ...]
Found 1 data race(s)
exit status 66

    line 71 is : copy(bufCopy, buf_Seq)
    line 63 is : buf_Seq = append(buf_Seq, line)
    line 73 is :}(genes, buf_Seq)




    package main

import (
    "bufio"
    "fmt"
    "os"
    "github.com/mathpl/golang-pkg-pcre/src/pkg/pcre"
    "sync"
)

// function read a list of genes and return a slice of gene names
func read_genes(filename string) []string {
    var genes []string // slice of genes names
    // Open the file.
    f, _ := os.Open(filename)
    // Create a new Scanner for the file.
    scanner := bufio.NewScanner(f)
    // Loop over all lines in the file and print them.
    for scanner.Scan() {
          line := scanner.Text()
        genes = append(genes, line)
    }
    return genes
}

// function find the sequences with a gene matching gene[] slice
func search_gene2( genes []string, seqs []string) ([]string) {
  var res []string

  for r := 0 ; r <= len(seqs) - 1; r++ {
    for i := 0 ; i <= len(genes) - 1; i++ {

      match := pcre.MustCompile(genes[i], 0).MatcherString(seqs[r], 0)

      if (match.Matches() == true) {
          res = append( res, seqs[r])           // is the gene matches the gene name is append to res
          break
      }
    }
  }

  return res
}
//###########################################

func main() {
    var slice []string
    var buf_Seq []string
    read_buff := 100    // the number of sequences analysed by one goroutine

    var wg sync.WaitGroup
    queue := make(chan []string, 100)

    filename := "fasta/sequences.tsv"
    f, _ := os.Open(filename)
    scanner := bufio.NewScanner(f)
    n := 0
    genes := read_genes("lists/genes.csv")

    for scanner.Scan() {
            line := scanner.Text()
            n += 1
            buf_Seq = append(buf_Seq, line) // store the sequences into buf_Seq
            if n == read_buff {   // when the read buffer contains 100 sequences one goroutine analyses them

          wg.Add(1)

          go func(genes, buf_Seq []string) {
            defer wg.Done()
                        bufCopy := make([]string, len(buf_Seq))
                        copy(bufCopy, buf_Seq)
            queue <- search_gene2( genes, bufCopy)
            }(genes, buf_Seq)
                        buf_Seq = buf_Seq[:0]   // reset buf_Seq
              n = 0 // reset the sequences counter

        }
    }
    go func() {
            wg.Wait()
            close(queue)
        }()

        for t := range queue {
            slice = append(slice, t...)
        }

        fmt.Println(slice)
}

Answer 1

goroutine仅在slice头的副本上工作，基础数组相同。 要复制切片，您需要使用copy （或append到其他切片）。

buf_Seq = append(buf_Seq, line)
bufCopy := make([]string, len(buf_Seq))
copy(bufCopy, buf_Seq)

然后，您可以安全地将bufCopy传递给goroutine，或者直接在闭包中直接使用它。

Answer 2

切片确实是副本，但是切片本身是引用类型 。 切片基本上是3字结构。 它包含一个指向基础数组开始的指针，一个整数表示切片中元素的当前数量，另一个整数表示基础数组的容量。 当您将切片传递给函数时，此切片的“标头”结构将构成一个副本，但标头仍引用与传入的标头相同的基础数组。

这意味着您对切片标头本身所做的任何更改（例如对其进行子切片，附加到足以触发调整大小的操作（从而使用新的起始指针重新分配到新位置）等）都只会反映在该函数内的切片标头。 但是，基础数据本身的任何更改都将反映在函数外部的切片中（除非您因切片超出容量而触发重新分配）。

示例： https ： //play.golang.org/p/a2y5eGulXW

Answer 3

我认为这是惯用的Go （针对此工作）：
一个代码值得一千条评论：

genes = readGenes("lists/genes.csv") // read the gene list
n := runtime.NumCPU()                // the number of goroutines
wg.Add(n + 1)
go scan() // read the "fasta/sequences.tsv"
for i := 0; i < n; i++ {
    go search()
}
go WaitClose()
slice := []string{}
for t := range queue {
    slice = append(slice, t)
}
fmt.Println(slice)

scan()读取“ fasta / sequences.tsv”到该通道： var ch = make(chan string, 100)同时进行， search()是NumCPU大量CPU的goroutine，因此出于性能原因，goroutine的数量限制为NumCPU 。

尝试以下工作示例代码（经过仿真和测试）：

package main

import (
    "bufio"
    "fmt"
    //"os"
    "runtime"
    "strings"
    "sync"
    //"github.com/mathpl/golang-pkg-pcre/src/pkg/pcre"
)

func main() {
    genes = readGenes("lists/genes.csv") // read the gene list
    n := runtime.NumCPU()                // the number of goroutines
    wg.Add(n + 1)
    go scan() // read the "fasta/sequences.tsv"
    for i := 0; i < n; i++ {
        go search()
    }
    go WaitClose()
    slice := []string{}
    for t := range queue {
        slice = append(slice, t)
    }
    fmt.Println(slice)
}

var wg sync.WaitGroup
var genes []string
var ch = make(chan string, 100)
var queue = make(chan string, 100)

func scan() {
    defer wg.Done()
    defer close(ch)
    scanner := bufio.NewScanner(strings.NewReader(strings.Join([]string{"A2", "B2", "C2", "D2", "E2", "F2", "G2", "H2", "I2"}, "\n")))
    /*f, err := os.Open("fasta/sequences.tsv")
    if err != nil {
        panic(err)
    }
    defer f.Close()
     scanner := bufio.NewScanner(f)*/
    for scanner.Scan() {
        ch <- scanner.Text()
    }
}

func match(pattern, seq string) bool {
    //return pcre.MustCompile(pattern, 0).MatcherString(seq, 0).Matches()
    return pattern[0] == seq[0]
}

func search() {
    defer wg.Done()
    for seq := range ch {
        for _, gene := range genes {
            if match(gene, seq) {
                queue <- seq
                break
            }
        }
    }
}

func WaitClose() {
    wg.Wait()
    close(queue)
}

// function read a list of genes and return a slice of gene names.
func readGenes(filename string) []string {
    return []string{"A1", "B1", "C1", "D1", "E1", "F1", "G1", "H1", "I1"}
    /*var genes []string // slice of genes names
    f, err := os.Open(filename)
    if err != nil {
        panic(err)
    }
    defer f.Close()
    scanner := bufio.NewScanner(f)
    for scanner.Scan() {
        line := scanner.Text()
        genes = append(genes, line)
    }
    return genes*/
}

输出：

[A2 B2 C2 D2 E2 F2 G2 H2 I2]

我希望这对您的实际情况有所帮助（注释已在该代码中切换，未经测试）：

package main

import (
    "bufio"
    "fmt"
    "os"
    "runtime"
    //"strings"
    "sync"

    "github.com/mathpl/golang-pkg-pcre/src/pkg/pcre"
    //pcre "regexp"
)

func main() {
    genes = readGenes("lists/genes.csv") // read the gene list
    n := runtime.NumCPU()                // the number of goroutines
    wg.Add(n + 1)
    go scan() // read the "fasta/sequences.tsv"
    for i := 0; i < n; i++ {
        go search()
    }
    go WaitClose()
    slice := []string{}
    for t := range queue {
        slice = append(slice, t)
    }
    fmt.Println(slice)
}

var wg sync.WaitGroup
var genes []string
var ch = make(chan string, 100)
var queue = make(chan string, 100)

func scan() {
    defer wg.Done()
    defer close(ch)
    //scanner := bufio.NewScanner(strings.NewReader(strings.Join([]string{"A2", "B2", "C2", "D2", "E2", "F2", "G2", "H2", "I2"}, "\n")))
    f, err := os.Open("fasta/sequences.tsv")
    if err != nil {
        panic(err)
    }
    defer f.Close()
    scanner := bufio.NewScanner(f)
    for scanner.Scan() {
        ch <- scanner.Text()
    }
}

func match(pattern, seq string) bool {
    return pcre.MustCompile(pattern, 0).MatcherString(seq, 0).Matches()
    //return pattern[0] == seq[0]
    //return pcre.MustCompile(pattern).Match([]byte(seq))
}

func search() {
    defer wg.Done()
    for seq := range ch {
        for _, gene := range genes {
            if match(gene, seq) {
                queue <- seq
                break
            }
        }
    }
}

func WaitClose() {
    wg.Wait()
    close(queue)
}

// function read a list of genes and return a slice of gene names.
func readGenes(filename string) []string {
    //return []string{"A1", "B1", "C1", "D1", "E1", "F1", "G1", "H1", "I1"}
    var genes []string // slice of genes names
    f, err := os.Open(filename)
    if err != nil {
        panic(err)
    }
    defer f.Close()
    scanner := bufio.NewScanner(f)
    for scanner.Scan() {
        line := scanner.Text()
        genes = append(genes, line)
    }
    return genes
}

您的代码问题：
1-在read_genes(filename string) []string您应该检查错误：

f, err := os.Open(filename)
if err!=nil{
    panic(err)
}

2-在read_genes(filename string) []string关闭打开的文件：

defer f.Close()

3-在filename := "fasta/sequences.tsv"您应该检查错误：

f, err := os.Open(filename)
if err!=nil{
    panic(err)
}

4- filename := "fasta/sequences.tsv"之后filename := "fasta/sequences.tsv"关闭打开的文件：

defer f.Close()

5- for scanner.Scan() {内部for scanner.Scan() {如果此文件fasta/sequences.tsv不包含100行的倍数， if n == read_buff {最后一个切片不成功，您将错过它。

6-您有几个CPU内核？ 您应该限制goroutine的数量。
7-您的主要问题：
我做了一个最小，完整和可验证的示例（仍然存在问题5）：

package main

import (
    "bufio"
    "fmt"
    "strings"
    "sync"
)

func match(pattern, str string) bool {
    return pattern[0] == str[0]
}
func search_gene2(genes, seqs []string) (res []string) {
    for _, r := range seqs {
        for _, i := range genes {
            if match(i, r) {
                res = append(res, r) // is the gene matches the gene name is append to res
                break
            }
        }
    }
    return
}

func main() {
    read_buff := 2 // the number of sequences analysed by one goroutine
    var wg sync.WaitGroup
    queue := make(chan []string, read_buff)
    genes := []string{"A1", "B1", "C1", "D1", "E1", "F1", "G1", "H1", "I1"}
    sequences := strings.Join([]string{"A2", "B2", "C2", "D2", "E2", "F2", "G2", "H2", "I2"}, "\n")
    scanner := bufio.NewScanner(strings.NewReader(sequences))
    buf_Seq := make([]string, 0, read_buff)
    for n := 1; scanner.Scan(); n++ {
        line := scanner.Text()
        buf_Seq = append(buf_Seq, line) // store the sequences into buf_Seq
        if n == read_buff {             // when the read buffer contains 100 sequences one goroutine analyses them
            wg.Add(1)
            temp := make([]string, n)
            copy(temp, buf_Seq)
            buf_Seq = buf_Seq[:0] // reset buf_Seq
            n = 0                 // reset the sequences counter
            go func(genes, Seq []string) {
                defer wg.Done()
                fmt.Println(Seq)
                queue <- search_gene2(genes, Seq)
            }(genes, temp)
        }
    }
    go func() {
        wg.Wait()
        close(queue)
    }()
    slice := []string{}
    for t := range queue {
        slice = append(slice, t...)
    }
    fmt.Println(slice)
}

输出（5： I2 ？）：

[A2 B2]
[C2 D2]
[E2 F2]
[G2 H2]
[A2 B2 C2 D2 E2 F2 G2 H2]

这是您的主要问题的解决方案（制作一个新切片并复制所有数据）：

temp := make([]string, n)
copy(temp, buf_Seq)
buf_Seq = buf_Seq[:0] // reset buf_Seq
n = 0                 // reset the sequences counter
go func(genes, Seq []string) {
    defer wg.Done()
    fmt.Println(Seq)
    queue <- search_gene2(genes, Seq)
}(genes, temp)

原因：
找到1个数据竞赛退出状态66

    line 71 is : copy(bufCopy, buf_Seq)
    line 63 is : buf_Seq = append(buf_Seq, line)
    line 73 is :}(genes, buf_Seq)

正如其他答案所说：您与所有goroutine共享了相同的slice底层数组。

我希望这有帮助。

Answer 4

之所以存在数据竞争，是因为切片是Go中的引用类型。 它们通常按值传递，但作为引用类型，对一个值所做的任何更改都会反映在另一个值中。 考虑：

func f(xs []string) {
    xs[0] = "changed_in_f"
}

func main() {
    xs := []string{"set_in_ main", "asd"}
    fmt.Println("Before call:", xs)
    f(xs)
    fmt.Println("After call:", xs)

    var ys []string
    ys = xs
    ys[0] = "changed_through_ys"
    fmt.Println("After ys:", xs)

}

打印：

Before call: [set_in_main asd]
After call: [changed_in_f asd]
After ys: [changed_through_ys asd]

发生这种情况是因为所有三个片共享内存中的相同基础数组。 更多细节在这里。

当您将buf_Seq传递给search_gene2时，可能会发生这种情况。 新的分片值将传递给每个调用，但是，每个分片值可能引用相同的基础数组，从而导致潜在的竞争状况（ append调用可能会改变分片的基础数组）。

要解决该问题，请在您的main尝试以下操作：

bufCopy := make([]string, len(buf_Seq))
// make a copy of buf_Seq in an entirely separate slice
copy(buffCopy, buf_Seq)
go func(genes, buf_Seq []string) {
        defer wg.Done()
        queue <- search_gene2( genes, bufCopy)
    }(genes, buf_Seq)
}

Goroutines分享片：：试图了解数据竞赛

问题描述

4 个解决方案

解决方案1
4 2016-08-12 17:36:56

解决方案2
1 2016-08-12 17:53:01

解决方案3
1

解决方案4
0 已采纳 2016-08-12 17:37:49

Goroutines分享片：：试图了解数据竞赛

问题描述

4 个解决方案

解决方案1 4 2016-08-12 17:36:56

解决方案2 1 2016-08-12 17:53:01

解决方案3 1

解决方案4 0 已采纳 2016-08-12 17:37:49

解决方案1
4 2016-08-12 17:36:56

解决方案2
1 2016-08-12 17:53:01

解决方案3
1

解决方案4
0 已采纳 2016-08-12 17:37:49