Ruby Nokogiri HTML抓取表与CSS问题

Question

I have an issue with the scraping of an html-table. 我在抓取html表时遇到问题。 Here is the link : https://www.basketball-reference.com/players/c/curryst01/gamelog/2016 (yes, it's a famous introductive tutorial for Ruby-scraping). 这是链接： https : //www.basketball-reference.com/players/c/curryst01/gamelog/2016 （是的，这是著名的Ruby抓取性入门教程）。 Here is the code related : 这是相关的代码：

doc = Nokogiri::HTML.parse(open(link))

# Get the biggest table 
big_table = doc.css("table").sort { |x,y| y.css("tr").count <=> x.css("tr").count }.first

# Number of rows is 87, but there are 5 heads that I wanna remove   
big_table.css("tr").count

# This doesn't remove heads 
big_table = big_table.select { |row| row.css("th").empty? }

In fact in HTML (I know nothing about HTML and i am in Ruby since 4h) th is the tag for header, td is for a standard cell, and tr is just a line. 实际上，在HTML中（我对HTML一无所知，从4h开始我就在Ruby中）th是标头的标记，td是标准单元格的标记，tr是一行。 The goal was to delete the header, so as the .empty return if a nodeset (nodeset is like the content of a tag ? ) is empty, this last line of code should have return only the tr elements. 目的是删除标头，以便在节点集（nodeset类似于标记的内容？）为空时返回.empty ，这最后一行代码应仅返回tr元素。 But it doesn't work, in fact the result is [] . 但这是行不通的，实际上结果是[]。
Instead, I noticed that : big_table.select{|row| row.css("td").empty?}.count 相反，我注意到： big_table.select{|row| row.css("td").empty?}.count big_table.select{|row| row.css("td").empty?}.count was equal to 5 ... So, i decided to do : big_table.select{|row| row.css("td").empty?}.count等于5 ...因此，我决定这样做：

big_table = big_table.select{|row| row.css("td").any?} big_table = big_table.select{|row| row.css("td").any?} and it worked well... big_table = big_table.select{|row| row.css("td").any?} ，效果很好...

My question is : why did this line works ? 我的问题是：为什么这条线有效？ and why the first attempt did fail ? 为什么第一次尝试失败了？ Maybe it's something in the HTML structure that i'm missing ... 也许是我缺少的HTML结构中的某些东西...

Thanks ! 谢谢！

Answer 1

Let's take a look at big_table 让我们看一下big_table

> big_table.class
 => Nokogiri::XML::NodeSet

> big_table.size
 => 1

So first of all, doing Enumerable#select against big_table is probably not doing what you expect. 因此，首先，对big_table执行Enumerable#select可能未达到您的期望。 If instead you capture the rows: 相反，如果您捕获行：

> rows = big_table.css("tr")
> rows.count
 => 87

Now you can do your select on the rows. 现在，您可以在行上进行select 。 Let's take an arbitrary row and see what it contains: 让我们来一个任意行，看看它包含什么：

> rows[2].css("td").count
 => 29

> rows[2].css("th").count
 => 1

So a typical row has 29 td elements and one th . 因此，典型的行包含29个td元素和1 th元素。 In fact every row has at least one th , which is why the css("th").empty? 实际上，每一行至少有一个th ，这就是为什么css("th").empty? returned nothing. 什么也没返回。 Conversely, the all-header rows do not contain any td elements, which is why what you tried worked. 相反，所有标题行均不包含任何td元素，这就是您尝试工作的原因。

Ruby Nokogiri HTML抓取表与CSS问题

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-07-29 13:38:38

Ruby Nokogiri HTML抓取表与CSS问题

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-07-29 13:38:38

解决方案1
1 已采纳 2017-07-29 13:38:38