使用Nokogiri和Mechanize解析html表

Question

使用以下代碼我試圖從我們的電話提供商的Web應用程序中刪除一個呼叫日志，將信息輸入我的Ruby on Rails應用程序。

desc "Import incoming calls"
task :fetch_incomingcalls => :environment do

    # Logs into manage.phoneprovider.co.uk and retrieved list of incoming calls.
    require 'rubygems'
    require 'mechanize'
    require 'logger'

    # Create a new mechanize object
    agent = Mechanize.new { |a| a.log = Logger.new(STDERR) }

    # Load the Phone Provider website
    page = agent.get("https://manage.phoneprovider.co.uk/login")

    # Select the first form
    form = agent.page.forms.first
    form.username = 'username
    form.password = 'password

    # Submit the form
    page = form.submit form.buttons.first

    # Click on link called Call Logs
    page = agent.page.link_with(:text => "Call Logs").click

    # Click on link called Incoming Calls
    page = agent.page.link_with(:text => "Incoming Calls").click

    # Prints out table rows
    # puts doc.css('table > tr')

    # Print out the body as a test
    # puts page.body

end

從最后五行可以看出，我已經測試了'puts page.body'是否成功運行並且上面的代碼有效。 它成功登錄然后導航到呼叫日志，然后導入來電。來電呼叫表如下所示：

| Timestamp    |    Source    |    Destination    |    Duration    |
| 03 Jan 13:40 |    12345678  |    12345679       |    00:01:01    |    
| 03 Jan 13:40 |    12345678  |    12345679       |    00:01:01    |    
| 03 Jan 13:40 |    12345678  |    12345679       |    00:01:01    |    
| 03 Jan 13:40 |    12345678  |    12345679       |    00:01:01    |

這是從以下代碼生成的：

<thead>
<tr>
<td>Timestamp</td>
<td>Source</td>
<td>Destination</td>
<td>Duration</td>
<td>Cost</td>
<td class='centre'>Recording</td>
</tr>
</thead>
<tbody>
<tr class='o'>
<tr>
<td>03 Jan 13:40</td>
<td>12345678</td>
<td>12345679</td>
<td>00:01:14</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='e'>
<tr>
<td>30 Dec 20:31</td>
<td>12345678</td>
<td>12345679</td>
<td>00:02:52</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='o'>
<tr>
<td>24 Dec 00:03</td>
<td>12345678</td>
<td>12345679</td>
<td>00:00:09</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='e'>
<tr>
<td>23 Dec 14:56</td>
<td>12345678</td>
<td>12345679</td>
<td>00:00:07</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='o'>
<tr>
<td>21 Dec 13:26</td>
<td>07793770851</td>
<td>12345679</td>
<td>00:00:26</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>

我正在嘗試找出如何僅選擇我想要的單元格（時間戳，源，目標和持續時間）並輸出它們。 然后我可以擔心將它們輸出到數據庫而不是終端。

我嘗試過使用Selector Gadget，但如果選擇多個，它只顯示'td'或'tr：nth-child（6）td，tr：nth-child（2）td'。

任何幫助或指針將不勝感激！

Answer 1

表中有一個模式，使用XPath很容易利用。 具有所需信息的行的<tr>標記缺少class屬性。 幸運的是，XPath提供了一些簡單的邏輯操作，包括not() 。 這提供了我們需要的功能。

一旦我們減少了我們正在處理的行數，我們就可以迭代行並使用XPath的element[n]選擇器提取必要列的文本。 這里一個重要的注意事項是XPath計算從1開始的元素，因此表行的第一列將是td[1] 。

使用Nokogiri（和規范）的示例代碼：

require "rspec"
require "nokogiri"

HTML = <<HTML
<table>
  <thead>
    <tr>
      <td>
        Timestamp
      </td>
      <td>
        Source
      </td>
      <td>
        Destination
      </td>
      <td>
        Duration
      </td>
      <td>
        Cost
      </td>
      <td class='centre'>
        Recording
      </td>
    </tr>
  </thead>
  <tbody>
    <tr class='o'>
      <td></td>
    </tr>
    <tr>
      <td>
        03 Jan 13:40
      </td>
      <td>
        12345678
      </td>
      <td>
        12345679
      </td>
      <td>
        00:01:14
      </td>
      <td></td>
      <td class='opt recording'></td>
    </tr>
    <tr class='e'>
      <td></td>
    </tr>
    <tr>
      <td>
        30 Dec 20:31
      </td>
      <td>
        12345678
      </td>
      <td>
        12345679
      </td>
      <td>
        00:02:52
      </td>
      <td></td>
      <td class='opt recording'></td>
    </tr>
    <tr class='o'>
      <td></td>
    </tr>
    <tr>
      <td>
        24 Dec 00:03
      </td>
      <td>
        12345678
      </td>
      <td>
        12345679
      </td>
      <td>
        00:00:09
      </td>
      <td></td>
      <td class='opt recording'></td>
    </tr>
    <tr class='e'>
      <td></td>
    </tr>
    <tr>
      <td>
        23 Dec 14:56
      </td>
      <td>
        12345678
      </td>
      <td>
        12345679
      </td>
      <td>
        00:00:07
      </td>
      <td></td>
      <td class='opt recording'></td>
    </tr>
    <tr class='o'>
      <td></td>
    </tr>
    <tr>
      <td>
        21 Dec 13:26
      </td>
      <td>
        07793770851
      </td>
      <td>
        12345679
      </td>
      <td>
        00:00:26
      </td>
      <td></td>
      <td class='opt recording'></td>
    </tr>
  </tbody>
</table>
HTML

class TableExtractor  
  def extract_data html
    Nokogiri::HTML(html).xpath("//table/tbody/tr[not(@class)]").collect do |row|
      timestamp   = row.at("td[1]").text.strip
      source      = row.at("td[2]").text.strip
      destination = row.at("td[3]").text.strip
      duration    = row.at("td[4]").text.strip
      {:timestamp => timestamp, :source => source, :destination => destination, :duration => duration}
    end
  end
end

describe TableExtractor do
  before(:all) do
    @html = HTML
  end

  it "should extract the timestamp properly" do
    subject.extract_data(@html)[0][:timestamp].should eq "03 Jan 13:40"
  end

  it "should extract the source properly" do
    subject.extract_data(@html)[0][:source].should eq "12345678"
  end

  it "should extract the destination properly" do
    subject.extract_data(@html)[0][:destination].should eq "12345679"
  end

  it "should extract the duration properly" do
    subject.extract_data(@html)[0][:duration].should eq "00:01:14"
  end

  it "should extract all informational rows" do
    subject.extract_data(@html).count.should eq 5
  end
end

Answer 2

你的答案就在於這個軌道廣播

http://railscasts.com/episodes/190-screen-scraping-with-nokogiri

這也有幫助

如何使用Nokogiri解析HTML表格？

Answer 3

您應該能夠使用XPath選擇器從根（最壞的情況）到達您需要的確切節點。 這里列出了使用XPath和Nokogiri。

有關如何使用XPath訪問所有元素的詳細信息，請查看此處。

使用Nokogiri和Mechanize解析html表

問題描述

3 個解決方案

解決方案1
10 已采納 2012-01-06 20:03:52

解決方案2
2 2012-01-09 13:06:11

解決方案3
-1 2012-01-06 06:50:46

使用Nokogiri和Mechanize解析html表

問題描述

3 個解決方案

解決方案1 10 已采納 2012-01-06 20:03:52

解決方案2 2 2012-01-09 13:06:11

解決方案3 -1 2012-01-06 06:50:46

解決方案1
10 已采納 2012-01-06 20:03:52

解決方案2
2 2012-01-09 13:06:11

解決方案3
-1 2012-01-06 06:50:46