[英]Parse html table using Nokogiri and Mechanize
使用以下代碼我試圖從我們的電話提供商的Web應用程序中刪除一個呼叫日志,將信息輸入我的Ruby on Rails應用程序。
desc "Import incoming calls"
task :fetch_incomingcalls => :environment do
# Logs into manage.phoneprovider.co.uk and retrieved list of incoming calls.
require 'rubygems'
require 'mechanize'
require 'logger'
# Create a new mechanize object
agent = Mechanize.new { |a| a.log = Logger.new(STDERR) }
# Load the Phone Provider website
page = agent.get("https://manage.phoneprovider.co.uk/login")
# Select the first form
form = agent.page.forms.first
form.username = 'username
form.password = 'password
# Submit the form
page = form.submit form.buttons.first
# Click on link called Call Logs
page = agent.page.link_with(:text => "Call Logs").click
# Click on link called Incoming Calls
page = agent.page.link_with(:text => "Incoming Calls").click
# Prints out table rows
# puts doc.css('table > tr')
# Print out the body as a test
# puts page.body
end
從最后五行可以看出,我已經測試了'puts page.body'是否成功運行並且上面的代碼有效。 它成功登錄然后導航到呼叫日志,然后導入來電。來電呼叫表如下所示:
| Timestamp | Source | Destination | Duration |
| 03 Jan 13:40 | 12345678 | 12345679 | 00:01:01 |
| 03 Jan 13:40 | 12345678 | 12345679 | 00:01:01 |
| 03 Jan 13:40 | 12345678 | 12345679 | 00:01:01 |
| 03 Jan 13:40 | 12345678 | 12345679 | 00:01:01 |
這是從以下代碼生成的:
<thead>
<tr>
<td>Timestamp</td>
<td>Source</td>
<td>Destination</td>
<td>Duration</td>
<td>Cost</td>
<td class='centre'>Recording</td>
</tr>
</thead>
<tbody>
<tr class='o'>
<tr>
<td>03 Jan 13:40</td>
<td>12345678</td>
<td>12345679</td>
<td>00:01:14</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='e'>
<tr>
<td>30 Dec 20:31</td>
<td>12345678</td>
<td>12345679</td>
<td>00:02:52</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='o'>
<tr>
<td>24 Dec 00:03</td>
<td>12345678</td>
<td>12345679</td>
<td>00:00:09</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='e'>
<tr>
<td>23 Dec 14:56</td>
<td>12345678</td>
<td>12345679</td>
<td>00:00:07</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='o'>
<tr>
<td>21 Dec 13:26</td>
<td>07793770851</td>
<td>12345679</td>
<td>00:00:26</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
我正在嘗試找出如何僅選擇我想要的單元格(時間戳,源,目標和持續時間)並輸出它們。 然后我可以擔心將它們輸出到數據庫而不是終端。
我嘗試過使用Selector Gadget,但如果選擇多個,它只顯示'td'或'tr:nth-child(6)td,tr:nth-child(2)td'。
任何幫助或指針將不勝感激!
表中有一個模式,使用XPath很容易利用。 具有所需信息的行的<tr>
標記缺少class
屬性。 幸運的是,XPath提供了一些簡單的邏輯操作,包括not()
。 這提供了我們需要的功能。
一旦我們減少了我們正在處理的行數,我們就可以迭代行並使用XPath的element[n]
選擇器提取必要列的文本。 這里一個重要的注意事項是XPath計算從1開始的元素,因此表行的第一列將是td[1]
。
使用Nokogiri(和規范)的示例代碼:
require "rspec"
require "nokogiri"
HTML = <<HTML
<table>
<thead>
<tr>
<td>
Timestamp
</td>
<td>
Source
</td>
<td>
Destination
</td>
<td>
Duration
</td>
<td>
Cost
</td>
<td class='centre'>
Recording
</td>
</tr>
</thead>
<tbody>
<tr class='o'>
<td></td>
</tr>
<tr>
<td>
03 Jan 13:40
</td>
<td>
12345678
</td>
<td>
12345679
</td>
<td>
00:01:14
</td>
<td></td>
<td class='opt recording'></td>
</tr>
<tr class='e'>
<td></td>
</tr>
<tr>
<td>
30 Dec 20:31
</td>
<td>
12345678
</td>
<td>
12345679
</td>
<td>
00:02:52
</td>
<td></td>
<td class='opt recording'></td>
</tr>
<tr class='o'>
<td></td>
</tr>
<tr>
<td>
24 Dec 00:03
</td>
<td>
12345678
</td>
<td>
12345679
</td>
<td>
00:00:09
</td>
<td></td>
<td class='opt recording'></td>
</tr>
<tr class='e'>
<td></td>
</tr>
<tr>
<td>
23 Dec 14:56
</td>
<td>
12345678
</td>
<td>
12345679
</td>
<td>
00:00:07
</td>
<td></td>
<td class='opt recording'></td>
</tr>
<tr class='o'>
<td></td>
</tr>
<tr>
<td>
21 Dec 13:26
</td>
<td>
07793770851
</td>
<td>
12345679
</td>
<td>
00:00:26
</td>
<td></td>
<td class='opt recording'></td>
</tr>
</tbody>
</table>
HTML
class TableExtractor
def extract_data html
Nokogiri::HTML(html).xpath("//table/tbody/tr[not(@class)]").collect do |row|
timestamp = row.at("td[1]").text.strip
source = row.at("td[2]").text.strip
destination = row.at("td[3]").text.strip
duration = row.at("td[4]").text.strip
{:timestamp => timestamp, :source => source, :destination => destination, :duration => duration}
end
end
end
describe TableExtractor do
before(:all) do
@html = HTML
end
it "should extract the timestamp properly" do
subject.extract_data(@html)[0][:timestamp].should eq "03 Jan 13:40"
end
it "should extract the source properly" do
subject.extract_data(@html)[0][:source].should eq "12345678"
end
it "should extract the destination properly" do
subject.extract_data(@html)[0][:destination].should eq "12345679"
end
it "should extract the duration properly" do
subject.extract_data(@html)[0][:duration].should eq "00:01:14"
end
it "should extract all informational rows" do
subject.extract_data(@html).count.should eq 5
end
end
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.