使用Nokogiri和Mechanize解析html表

Question

Using the following code I am trying to scrape a call log from our phone provider's web application to enter the info into my Ruby on Rails application. 使用以下代码我试图从我们的电话提供商的Web应用程序中删除一个呼叫日志，将信息输入我的Ruby on Rails应用程序。

desc "Import incoming calls"
task :fetch_incomingcalls => :environment do

    # Logs into manage.phoneprovider.co.uk and retrieved list of incoming calls.
    require 'rubygems'
    require 'mechanize'
    require 'logger'

    # Create a new mechanize object
    agent = Mechanize.new { |a| a.log = Logger.new(STDERR) }

    # Load the Phone Provider website
    page = agent.get("https://manage.phoneprovider.co.uk/login")

    # Select the first form
    form = agent.page.forms.first
    form.username = 'username
    form.password = 'password

    # Submit the form
    page = form.submit form.buttons.first

    # Click on link called Call Logs
    page = agent.page.link_with(:text => "Call Logs").click

    # Click on link called Incoming Calls
    page = agent.page.link_with(:text => "Incoming Calls").click

    # Prints out table rows
    # puts doc.css('table > tr')

    # Print out the body as a test
    # puts page.body

end

As you can see from the last five lines, I have tested that the 'puts page.body' works successfully and the above code works. 从最后五行可以看出，我已经测试了'puts page.body'是否成功运行并且上面的代码有效。 It successfully logs in and then navigates to Call Logs followed by Incoming Calls.The incoming call table looks like this: 它成功登录然后导航到呼叫日志，然后导入来电。来电呼叫表如下所示：

| Timestamp    |    Source    |    Destination    |    Duration    |
| 03 Jan 13:40 |    12345678  |    12345679       |    00:01:01    |    
| 03 Jan 13:40 |    12345678  |    12345679       |    00:01:01    |    
| 03 Jan 13:40 |    12345678  |    12345679       |    00:01:01    |    
| 03 Jan 13:40 |    12345678  |    12345679       |    00:01:01    |

Which is generated from the following code: 这是从以下代码生成的：

<thead>
<tr>
<td>Timestamp</td>
<td>Source</td>
<td>Destination</td>
<td>Duration</td>
<td>Cost</td>
<td class='centre'>Recording</td>
</tr>
</thead>
<tbody>
<tr class='o'>
<tr>
<td>03 Jan 13:40</td>
<td>12345678</td>
<td>12345679</td>
<td>00:01:14</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='e'>
<tr>
<td>30 Dec 20:31</td>
<td>12345678</td>
<td>12345679</td>
<td>00:02:52</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='o'>
<tr>
<td>24 Dec 00:03</td>
<td>12345678</td>
<td>12345679</td>
<td>00:00:09</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='e'>
<tr>
<td>23 Dec 14:56</td>
<td>12345678</td>
<td>12345679</td>
<td>00:00:07</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='o'>
<tr>
<td>21 Dec 13:26</td>
<td>07793770851</td>
<td>12345679</td>
<td>00:00:26</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>

I'm trying to work out how to selects just the cells I want (Timestamp, Source, Destination and Duration) and output those. 我正在尝试找出如何仅选择我想要的单元格（时间戳，源，目标和持续时间）并输出它们。 I can then worry about outputting them to the database rather than in Terminal. 然后我可以担心将它们输出到数据库而不是终端。

I have tried using Selector Gadget but it just show either 'td' or 'tr:nth-child(6) td , tr:nth-child(2) td' if I select multiple. 我尝试过使用Selector Gadget，但如果选择多个，它只显示'td'或'tr：nth-child（6）td，tr：nth-child（2）td'。

Any help or pointers would be appreciated! 任何帮助或指针将不胜感激！

Answer 1

There is a pattern in the table that is easy to leverage using XPath. 表中有一个模式，使用XPath很容易利用。 The <tr> tag of rows with the required information lack the class attribute. 具有所需信息的行的<tr>标记缺少class属性。 Fortunately, XPath provides some simple logical operations including not() . 幸运的是，XPath提供了一些简单的逻辑操作，包括not() 。 This provides just the functionality we need. 这提供了我们需要的功能。

Once we've reduced the number of rows we're dealing with, we can iterate over the rows and extract the text of the necessary columns by using XPath's element[n] selector. 一旦我们减少了我们正在处理的行数，我们就可以迭代行并使用XPath的element[n]选择器提取必要列的文本。 One important note here is that XPath counts elements starting from 1, so the first column of a table row would be td[1] . 这里一个重要的注意事项是XPath计算从1开始的元素，因此表行的第一列将是td[1] 。

Example code using Nokogiri (and specs): 使用Nokogiri（和规范）的示例代码：

require "rspec"
require "nokogiri"

HTML = <<HTML
<table>
  <thead>
    <tr>
      <td>
        Timestamp
      </td>
      <td>
        Source
      </td>
      <td>
        Destination
      </td>
      <td>
        Duration
      </td>
      <td>
        Cost
      </td>
      <td class='centre'>
        Recording
      </td>
    </tr>
  </thead>
  <tbody>
    <tr class='o'>
      <td></td>
    </tr>
    <tr>
      <td>
        03 Jan 13:40
      </td>
      <td>
        12345678
      </td>
      <td>
        12345679
      </td>
      <td>
        00:01:14
      </td>
      <td></td>
      <td class='opt recording'></td>
    </tr>
    <tr class='e'>
      <td></td>
    </tr>
    <tr>
      <td>
        30 Dec 20:31
      </td>
      <td>
        12345678
      </td>
      <td>
        12345679
      </td>
      <td>
        00:02:52
      </td>
      <td></td>
      <td class='opt recording'></td>
    </tr>
    <tr class='o'>
      <td></td>
    </tr>
    <tr>
      <td>
        24 Dec 00:03
      </td>
      <td>
        12345678
      </td>
      <td>
        12345679
      </td>
      <td>
        00:00:09
      </td>
      <td></td>
      <td class='opt recording'></td>
    </tr>
    <tr class='e'>
      <td></td>
    </tr>
    <tr>
      <td>
        23 Dec 14:56
      </td>
      <td>
        12345678
      </td>
      <td>
        12345679
      </td>
      <td>
        00:00:07
      </td>
      <td></td>
      <td class='opt recording'></td>
    </tr>
    <tr class='o'>
      <td></td>
    </tr>
    <tr>
      <td>
        21 Dec 13:26
      </td>
      <td>
        07793770851
      </td>
      <td>
        12345679
      </td>
      <td>
        00:00:26
      </td>
      <td></td>
      <td class='opt recording'></td>
    </tr>
  </tbody>
</table>
HTML

class TableExtractor  
  def extract_data html
    Nokogiri::HTML(html).xpath("//table/tbody/tr[not(@class)]").collect do |row|
      timestamp   = row.at("td[1]").text.strip
      source      = row.at("td[2]").text.strip
      destination = row.at("td[3]").text.strip
      duration    = row.at("td[4]").text.strip
      {:timestamp => timestamp, :source => source, :destination => destination, :duration => duration}
    end
  end
end

describe TableExtractor do
  before(:all) do
    @html = HTML
  end

  it "should extract the timestamp properly" do
    subject.extract_data(@html)[0][:timestamp].should eq "03 Jan 13:40"
  end

  it "should extract the source properly" do
    subject.extract_data(@html)[0][:source].should eq "12345678"
  end

  it "should extract the destination properly" do
    subject.extract_data(@html)[0][:destination].should eq "12345679"
  end

  it "should extract the duration properly" do
    subject.extract_data(@html)[0][:duration].should eq "00:01:14"
  end

  it "should extract all informational rows" do
    subject.extract_data(@html).count.should eq 5
  end
end

Answer 2

Your answer lies in this railscasts 你的答案就在于这个轨道广播

http://railscasts.com/episodes/190-screen-scraping-with-nokogiri http://railscasts.com/episodes/190-screen-scraping-with-nokogiri

This too can help 这也有帮助

How do I parse an HTML table with Nokogiri? 如何使用Nokogiri解析HTML表格？

Answer 3

You should be able to reach the exact node you required from the root (worst case) using XPath selectors. 您应该能够使用XPath选择器从根（最坏的情况）到达您需要的确切节点。 Using XPath with Nokogiri is listed here . 这里列出了使用XPath和Nokogiri。

For detail on how reach all your elements using XPath, look here . 有关如何使用XPath访问所有元素的详细信息，请查看此处。

使用Nokogiri和Mechanize解析html表

问题描述

3 个解决方案

解决方案1
10 已采纳 2012-01-06 20:03:52

解决方案2
2 2012-01-09 13:06:11

解决方案3
-1 2012-01-06 06:50:46

使用Nokogiri和Mechanize解析html表

问题描述

3 个解决方案

解决方案1 10 已采纳 2012-01-06 20:03:52

解决方案2 2 2012-01-09 13:06:11

解决方案3 -1 2012-01-06 06:50:46

解决方案1
10 已采纳 2012-01-06 20:03:52

解决方案2
2 2012-01-09 13:06:11

解决方案3
-1 2012-01-06 06:50:46