简体   繁体   中英

Having trouble parsing these data in watir-webdriver

See hierarchy below:

在此处输入图片说明

All I need here is "Company Title", "Company Owner", "Company Owner Title", "Street Number Street Name", and "City, State Zipcode".

I tried b.div.span.bs , but that didn't work ( bs because there are multiple blocks I'm gathering data from). I also thought I'd just try something like b.tds.split('<br>') and then replace all instances of tags and somehow delete empty array cells, but I found that each block is different, so the data don't align, ie, Company Title might be in cell 1 for the first array, but then if Company Title isn't present (for the second block) then cell 1 would be Company Owner, which is conflicting... Anyway, just trying to find a clever way to get these data. Thank you.

Here is the actual HTML; however you must first click "View All".

You can split out everything inside the <div> and then split that by <br> . The first part is Company Title (if exists) and then Company Owner is last/second.

The rest is ... trickier. Some are pretty straighforward in that Fax and Member Since have labels so those are easy. The <a> is easy.

You could probably test the phone number with a regex and then back up from there. If the one before the phone number isn't <a> then it's city, state zip and the one before that is the address. If one exists before that, it's the Company Owner Title.

Everything after the phone number in your examples have labels so those are easy.

我不确定您的所有用例,但通常对于 DOM 不是很有帮助的页面,我只是获取文本并使用 Ruby 进行解析:

browser.td.text.split("\n").reject(&:empty?)

This doesn't directly answer the question, but it shows how I'd go about doing this using Nokogiri, which is the standard HTML/XML parser for Ruby:

require 'nokogiri'

doc = Nokogiri::HTML('<td><div></div><br>a<br>b<br>c</td>')

doc is Nokogiri's internal representation of the document.

We use landmarks in the markup to navigate and find things we want. In this case <div> is a good starting point:

doc.at('div').next_sibling.next_sibling.text # => "a"

next_sibling is how we tell Nokogiri to look at the next node. In this case it's stepping past the first <br> and looking at the a TextNode.

That'd result in unworkable code though, so there's a better way to go:

doc.search('td br').to_html # => "<br><br><br>"

That shows we can find all the <br> tags inside the <td> , so we just have to iterate over them and use them as our landmarks:

doc.search('td br').map{ |br| br.next_sibling.text } # => ["a", "b", "c"]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM