简体   繁体   中英

How do I parse a table into its meaningful chunks?

I need to extract a table of data on a collection of pages. I can already traverse the pages just fine.

How do I extract the table's data? I'm using Ruby and Nokogiri, but I would assume that this is a pretty general issue.

I underlined the desired data points in each row in the following image .

A sample of the html is: http://pastebin.com/YYFPbFLC

How would I parse this table into a hash via Nokogiri into the meaningful chunks?

The table's xpath is:

/html/body/table/tbody/tr/td[2]/table/tbody/tr[2]/td/table/tbody/tr/td[2]/table/tbody/tr/td/table/tbody/tr/td[2]/table

The table has a variable number of rows of data and formatting rows. I only want to collect the rows with meaningful data, but I don't readily see a way to distinguish this via an XPath except the second column will reliably have " keyword " in it. Each of these rows have an XPath of:

1st meaningful row is: /html/body/table/tbody/tr/td[2]/table/tbody/tr[2]/td/table/tbody/tr/td[2]/table/tbody/tr/td/table/tbody/tr/td[2]/table/tbody/tr[2]
...
Last meaningful row: /html/body/table/tbody/tr/td[2]/table/tbody/tr[2]/td/table/tbody/tr/td[2]/table/tbody/tr/td/table/tbody/tr/td[2]/table/tbody/tr[N]

The first meaningful column that needs to match text content on the "keyword" is:

/html/body/table/tbody/tr/td[2]/table/tbody/tr[2]/td/table/tbody/tr/td[2]/table/tbody/tr/td/table/tbody/tr/td[2]/table/tbody/tr[2]/td[2]

The last column of this first row of data would be:

/html/body/table/tbody/tr/td[2]/table/tbody/tr[2]/td/table/tbody/tr/td[2]/table/tbody/tr/td/table/tbody/tr/td[2]/table/tbody/tr[2]/td[6]

Each row is a record and has a timestamp with this column/ td being the time in the timestamp; The year, month and day are all in their own variables and can be appended for a full timestamp:

/html/body/table/tbody/tr/td[2]/table/tbody/tr[2]/td/table/tbody/tr/td[2]/table/tbody/tr/td/table/tbody/tr/td[2]/table/tbody/tr[2]/td[5]

The first rule of XPath is: never use the autogenerated XPath from Firebug or other browser tool. This creates brittle XPath that treats all page elements as equally important and required, even parts you don't care about. For example, if a notice went up at the top of the page and it happened to be in a table, it could throw off your parsing.

Instead, think about how a human would identify it. In this case, you want "the first table under the heading with the word 'today' in it". Here's the XPath for that:

//table[preceding-sibling::h2[contains(text(), "today")]][1]

This says take the tables that have a preceding h2 (in other words, that follow the h2 ), where the h2 contains the word "today". Then take the first such table.

Then you need to identify the rows you are interested in. Note that some rows are just dividers containing a single td , so you want to make sure you only parse the rows that have multiple td tags. In XPath, that is:

//tr[td[2]]

Then you just grab the content of all the columns. In the first one you can remove everything before the words "of magnitude" to get just the value. Putting it all together:

doc = Nokogiri::HTML.parse(html)

events = []

doc.xpath('//table[preceding-sibling::h2[contains(text(), "today")]][1]//tr[td[2]]').each do |row|
  cols = row.search('td/text()').map(&:to_s)
  events << {
    :magnitude   => cols[0].gsub(/^.*of magnitude /,''),
    :temp_area   => cols[1],
    :time_start  => cols[2],
    :time_middle => cols[3],
    :time_end    => cols[4]
  }
end

The output is:

[
 {:magnitude=>"F1.7",
  :temp_area=>"0",
  :time_start=>"01:11:00",
  :time_middle=>"01:24:00",
  :time_end=>"01:32:00"},
 {:magnitude=>"F3.1",
  :temp_area=>"0",
  :time_start=>"04:01:00",
  :time_middle=>"04:10:00",
  :time_end=>"04:26:00"},
 {:magnitude=>"F3.5",
  :temp_area=>"134F55",
  :time_start=>"06:24:00",
  :time_middle=>"06:42:00",
  :time_end=>"06:53:00"},
 {:magnitude=>"F1.4",
  :temp_area=>"0",
  :time_start=>"11:58:00",
  :time_middle=>"12:06:00",
  :time_end=>"12:16:00"},
 {:magnitude=>"F1.0",
  :temp_area=>"0",
  :time_start=>"13:02:00",
  :time_middle=>"13:05:00",
  :time_end=>"13:09:00"},
 {:magnitude=>"D53.7",
  :temp_area=>"134F55",
  :time_start=>"17:37:00",
  :time_middle=>"18:37:00",
  :time_end=>"18:56:00"}
]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM