[英]Goroutine didn't run as expected
我仍在學習 Go 並且正在練習 web 爬蟲,如鏈接所示。 我實現的主要部分如下。 (其他部分保持不變,可以在鏈接中找到。)
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher) {
// TODO: Fetch URLs in parallel.
// TODO: Don't fetch the same URL twice.
// This implementation doesn't do either:
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
cache.Set(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
for _, u := range urls {
if cache.Get(u) == false {
fmt.Println("Next:", u)
Crawl(u, depth-1, fetcher) // I want to parallelize this
}
}
return
}
func main() {
Crawl("https://golang.org/", 4, fetcher)
}
type SafeCache struct {
v map[string]bool
mux sync.Mutex
}
func (c *SafeCache) Set(key string) {
c.mux.Lock()
c.v[key] = true
c.mux.Unlock()
}
func (c *SafeCache) Get(key string) bool {
return c.v[key]
}
var cache SafeCache = SafeCache{v: make(map[string]bool)}
當我運行上面的代碼時,結果是預期的:
found: https://golang.org/ "The Go Programming Language"
Next: https://golang.org/pkg/
found: https://golang.org/pkg/ "Packages"
Next: https://golang.org/cmd/
not found: https://golang.org/cmd/
Next: https://golang.org/pkg/fmt/
found: https://golang.org/pkg/fmt/ "Package fmt"
Next: https://golang.org/pkg/os/
found: https://golang.org/pkg/os/ "Package os"
但是,當我嘗試通過將Crawl(u, depth-1, fetcher)
更改為go Crawl(u, depth-1, fetcher)
) 來並行化爬蟲(在上面程序中的注釋行)時,結果不是正如我所料:
found: https://golang.org/ "The Go Programming Language"
Next: https://golang.org/pkg/
Next: https://golang.org/cmd/
我認為直接添加go
關鍵字看起來很簡單,但我不確定出了什么問題,並且對如何最好地解決這個問題感到困惑。 任何意見,將不勝感激。 先感謝您!
您的程序很可能在爬蟲完成工作之前退出。 一種方法是讓Crawl
有一個WaitGroup
等待它的所有子爬蟲完成。 例如
import "sync"
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher, *wg sync.WaitGroup) {
defer func() {
// If the crawler was given a wait group, signal that it's finished
if wg != nil {
wg.Done()
}
}()
if depth <= 0 {
return
}
_, urls, err := fetcher.Fetch(url)
cache.Set(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
var crawlers sync.WaitGroup
for _, u := range urls {
if cache.Get(u) == false {
fmt.Println("Next:", u)
crawlers.Add(1)
go Crawl(u, depth-1, fetcher, &crawlers)
}
}
crawlers.Wait() // Waits for its sub-crawlers to finish
return
}
func main() {
// The root does not need a WaitGroup
Crawl("http://example.com/index.html", 4, nil)
}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.