简体   繁体   English

用ruby / nokogiri解析html表后如何获取正确的值

[英]How to get the proper values after a html table parse with ruby/nokogiri

I have searched and searched for 3 days straight now trying to get a data scraper to work and it seems like I have successfully parsed the HTML table that looks like this: 我已经连续搜索了3天,试图使数据刮板正常工作,而且看来我已经成功解析了如下所示的HTML表:

<tr class='ds'>
<td class='ds'>Length:</td>
<td class='ds'>1/8"</td>
</tr>
<tr class='ds'>
<td class='ds'>Width:</td>
<td class='ds'>3/4"</td>
</tr>
<tr class='ds'>
<td class='ds'>Color:</td>
<td class='ds'>Red</td>
</tr>

However, I can not seem to get it to print to csv correctly. 但是,我似乎无法正确打印到csv。

The Ruby code is as follows: Ruby代码如下:

Specifications = {
:length => ['Length:','length','Length'],       
:width => ['width:','width','Width','Width:'],  
:Color => ['Color:','color'], 
.......
}.freeze

def specifications
  @specifications ||= xml.css('tr.ds').map{|row| row.css('td.ds').map{|cell| cell.children.to_s } }.map{|record| 
  specification = Specifications.detect{|key, value| value.include? record.first } 
  [specification.to_s.titleize, record.last]  }
end 

And the csv is printing into one column (what seems to be the full arrays): csv打印到一列(似乎是完整的数组):

[["", nil], ["[:finishtype, [\"finish\", \"finish type:\", \"finish type\", \"finish type\", \"finish type:\"]]", "Metal"], ["", "1/4\""], ["[:length, [\"length:\", \"length\", \"length\"]]", "18\""], ["[:width, [\"width:\", \"width\", \"width\", \"width:\"]]", "1/2\""], ["[:styletype, [\"style:\", \"style\", \"style:\", \"style\"]]"........

I believe the issue is that I have not specified which values to return but I wasn't successful anytime I tried to specify the output. 我认为问题是我没有指定要返回的值,但是每次尝试指定输出时都没有成功。 Any help would be greatly appreciated! 任何帮助将不胜感激!

Try changing 尝试改变

[specification.to_s.titleize, record.last]

to

[specification.last.first.titleize, record.last]

The detect yields eg [:length, ["Length:", "length", "Length"]] which will become "[:length, [\\"Length:\\", \\"length\\", \\"Length\\"]]" by to_s . detect例如为[:length, ["Length:", "length", "Length"]] ,该值将变为"[:length, [\\"Length:\\", \\"length\\", \\"Length\\"]]"to_s With last.first you can extract just the part "Length:" of it. 使用last.first您可以仅提取其中的"Length:"部分。

In case you encounter attributes not matching to your Specification , you could just drop the values by changing to: 万一遇到与Specification不匹配的属性,可以通过更改为来删除值:

    xml.css('tr.ds').map{|row| row.css('td.ds').map{|cell| cell.children.to_s } }.map{|record|  
      specification = Specifications.detect{|key, value| value.include? record.first }
      [specification.last.first.titleize, record.last] if specification 
    }.compact

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM