简体   繁体   English

如何正确使用选择器

[英]How to use selectors properly

I'm writing a crawler to retrieve some data from some pages, the logic of how to build it is very clear for me but I am very confused in how to use the selectors properly.我正在编写一个爬虫来从某些页面检索一些数据,如何构建它的逻辑对我来说非常清楚,但我对如何正确使用选择器感到非常困惑。

I would like to get the title of some news using colly, I went to the page https://g1.globo.com/economia and inspected the title that I would like to extract information -> clicked inspect -> copy selector.我想使用 colly 获取一些新闻的标题,我转到页面https://g1.globo.com/economia并检查了我想提取信息的标题 -> 单击检查 -> 复制选择器。

the selector is选择器是

body > div.glb-grid > main > div.row.content-head.non-featured > div.title > h1正文 > div.glb-grid > main > div.row.content-head.non-featured > div.title > h1

How can I put it correctly in this line of code?我怎样才能把它正确地放在这行代码中?

detailCollector.OnHTML("body >  div.glb-grid > main > div.row.content-head.non-featured > div.title > h1", func(element *colly.HTMLElement) {
    fmt.Println(element.Text)

})

How is the correct way to parse this selector in a way that colly can understand?以 colly 可以理解的方式解析这个选择器的正确方法是什么? I couldn't find it in the colly documentation anything related to that.我在 colly 文档中找不到与此相关的任何内容。

The selectors aren't specific to colly.选择器并不特定于 colly。 It is using goquery 's Find function:它正在使用goquery的 Find 功能:

doc.Find(cc.Selector).Each(func(_ int, s *goquery.Selection)

But the example you provided represented CSS selectors.但是您提供的示例代表了 CSS 选择器。 So you can find the definitive reference for those in the standard here: https://www.w3.org/TR/selectors-3/#selectors因此,您可以在此处找到标准中的权威参考: https : //www.w3.org/TR/selectors-3/#selectors

BUT that particular web page does not seem to contain the selector you are looking for above.但是那个特定的网页似乎不包含您在上面寻找的选择器。

The example you provided is extremely specific which is probably why it is not matching anything.您提供的示例非常具体,这可能是它不匹配任何内容的原因。 Breaking it down it reads as:将其分解如下:

body >  div.glb-grid > main > div.row.content-head.non-featured > div.title > h1

Find an "h1" element that is a child of a div element with a classlist that contains title, that is itself a child of a div element that has a classlist that contains ALL of "row", "content-head", "non-featured" that is a child of main, that's a child of a div element with a classlist containing "glb-grid" that is a child of a body element.找到一个“h1”元素,它是具有包含标题的类列表的 div 元素的子元素,该元素本身是具有包含所有“行”、“内容头”、“非”的类列表的 div 元素的子元素-featured” 是 main 的子元素,它是 div 元素的子元素,其类列表包含“glb-grid”,它是 body 元素的子元素。

Contrasting this against the much simpler but more generic selector "h1", which yields only the web page title, as it seems to be the only "h1" element in the document, and this may explain your confusion.将此与更简单但更通用的选择器“h1”进行对比,后者仅产生网页标题,因为它似乎是文档中唯一的“h1”元素,这可能会解释您的困惑。

<h1 class="header-title"> 
<div class="header-title-content">
<a class="header-editoria--link" href="https://g1.globo.com/economia/">Economia</a>
</div>
</h1>

Added to that the page adjusts the DOM using Javascript, and you have somewhat of a moving target about what actually lies on the page.除此之外,页面使用 Javascript 调整 DOM,并且您对页面上的实际内容有一定的移动目标。

However, it's not all bad news as I suspect that the items you are looking for might simply require:-然而,也不全是坏消息,因为我怀疑您正在寻找的物品可能只需要:-

package main

import (
    "fmt"

    "github.com/gocolly/colly"
)

func main() {
    headlines := make(map[string]string)
    c := colly.NewCollector()
    c.OnHTML(".feed-post-link", func(e *colly.HTMLElement) {
        headlines[e.Text] = e.Attr("href")
    })

    c.Visit("https://g1.globo.com/economia")
    for hl, url := range headlines {
        fmt.Printf("'%v' - (%v)\n", hl, url)
    }
}

This uses a simple selector that chooses all HTML elements that have a class of "feed-post-link", which seems to include all of the headlines for that page.这使用了一个简单的选择器,它选择所有具有“feed-post-link”类的 HTML 元素,它似乎包括该页面的所有标题。 I've extracted the URLs as well as the corresponding titles in this example, but that was simple illustrative and you can ignore them if that is not what you require.我已经在这个例子中提取了 URL 和相应的标题,但这只是简单的说明,如果这不是你需要的,你可以忽略它们。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM