[英]Ruby Nokogiri HTML scraping table with CSS issue
I have an issue with the scraping of an html-table. 我在抓取html表时遇到问题。 Here is the link : https://www.basketball-reference.com/players/c/curryst01/gamelog/2016 (yes, it's a famous introductive tutorial for Ruby-scraping).
这是链接: https : //www.basketball-reference.com/players/c/curryst01/gamelog/2016 (是的,这是著名的Ruby抓取性入门教程)。 Here is the code related :
这是相关的代码:
doc = Nokogiri::HTML.parse(open(link))
# Get the biggest table
big_table = doc.css("table").sort { |x,y| y.css("tr").count <=> x.css("tr").count }.first
# Number of rows is 87, but there are 5 heads that I wanna remove
big_table.css("tr").count
# This doesn't remove heads
big_table = big_table.select { |row| row.css("th").empty? }
In fact in HTML (I know nothing about HTML and i am in Ruby since 4h) th is the tag for header, td is for a standard cell, and tr is just a line. 实际上,在HTML中(我对HTML一无所知,从4h开始我就在Ruby中)th是标头的标记,td是标准单元格的标记,tr是一行。 The goal was to delete the header, so as the
.empty
return if a nodeset (nodeset is like the content of a tag ? ) is empty, this last line of code should have return only the tr elements. 目的是删除标头,以便在节点集(nodeset类似于标记的内容?)为空时返回
.empty
,这最后一行代码应仅返回tr元素。 But it doesn't work, in fact the result is [] . 但这是行不通的,实际上结果是[]。
Instead, I noticed that : big_table.select{|row| row.css("td").empty?}.count
相反,我注意到:
big_table.select{|row| row.css("td").empty?}.count
big_table.select{|row| row.css("td").empty?}.count
was equal to 5 ... So, i decided to do : big_table.select{|row| row.css("td").empty?}.count
等于5 ...因此,我决定这样做:
big_table = big_table.select{|row| row.css("td").any?}
big_table = big_table.select{|row| row.css("td").any?}
and it worked well... big_table = big_table.select{|row| row.css("td").any?}
,效果很好...
My question is : why did this line works ? 我的问题是:为什么这条线有效? and why the first attempt did fail ?
为什么第一次尝试失败了? Maybe it's something in the HTML structure that i'm missing ...
也许是我缺少的HTML结构中的某些东西...
Thanks ! 谢谢 !
Let's take a look at big_table
让我们看一下
big_table
> big_table.class
=> Nokogiri::XML::NodeSet
> big_table.size
=> 1
So first of all, doing Enumerable#select
against big_table
is probably not doing what you expect. 因此,首先,对
big_table
执行Enumerable#select
可能未达到您的期望。 If instead you capture the rows: 相反,如果您捕获行:
> rows = big_table.css("tr")
> rows.count
=> 87
Now you can do your select
on the rows. 现在,您可以在行上进行
select
。 Let's take an arbitrary row and see what it contains: 让我们来一个任意行,看看它包含什么:
> rows[2].css("td").count
=> 29
> rows[2].css("th").count
=> 1
So a typical row has 29 td
elements and one th
. 因此,典型的行包含29个
td
元素和1 th
元素。 In fact every row has at least one th
, which is why the css("th").empty?
实际上,每一行至少有一个
th
,这就是为什么css("th").empty?
returned nothing. 什么也没返回。 Conversely, the all-header rows do not contain any
td
elements, which is why what you tried worked. 相反,所有标题行均不包含任何
td
元素,这就是您尝试工作的原因。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.