[英]Targeting elements of an a-tag with Nokogiri when classes dont work
我正在嘗試制造刮板,並且在以下方面需要幫助:
我想從一個a-tag抓取一堆數據,並將一些div / span嵌套在同一div中。 我的代碼如下所示:
page = Nokogiri::HTML(open(website))
page.search('.company').each { |e| companies << e.text.strip }
page.search('.jobtitle').each { |e| jobtitles << e.text.strip }
page.search('.location').each { |e| locations << e.text.strip }
page.xpath('//a[@class="turnstileLink"]').map{ |e| links << e['href'] }
對於前三個(公司,標題和位置),我得到16個或15個結果,但是對於最后一個搜索,我的數組僅包含10個元素。 奇怪的是,它們還不匹配其他數組之一的前10個,而是開始匹配其他數組之一的第3或第4個元素附近的某個位置。
我要定位的典型卡片的html在這里:
<div class="row result clickcard" id="pj_81c3e09223cbc6b3" data-jk="81c3e09223cbc6b3" data-advn="4563763653116462" data-tu="">
<a target="_blank" id="sja1" data-tn-element="jobTitle" class="jobtitle turnstileLink" href="/pagead/clk?mo=r&ad=-6NYlbfkN0DhDTzlYIMy8YIuVE6IrMC_kH05KGZgoAT6LTrcTn8STrwXoiuruouegXiAvJy4qud6xIecRibm3b0Q5eOBkpCiV3R04sAyQbvP7gt6NKZVpCRp32eFzXudmk-TIABX3xEZGo90a47Vz9OofqZaLDh37545RNQ3sFjM6VzWNEWwKf_YoXxeGKcAICj9AADyBuYAY7p9UIUxoox7J5U9gO8Zo2dvRW-i5FJtaUr49Vjsl04W0Jp-CN2azbfp6rrfT6RYFbJ_YAc2iI-L37eeygDtI4KXQwv_elrV8ZLEKo9rkcfEzbE129kX7JKeEq5wJ1dj7GJ4ONH1lIPJQd1gJLoqNYJVQlLTKJiBP72Z0RBmgfZQ-69U8AoEyMT6pytz6iqykLCnO-SxClmvFPJsNV96oBGzpMWtWQeVgGQ49jZfBBRq9Ubw7N73iEjCv6oQ70hcW1P4d8DYK0pCI7vu2KfUh0P9vx8AKC6wY2QoAZeeP4OiBIJ8ikKSIUYJTbe3UwKcLYP7r_3_rx1gY_JO1ReG21ctCxfqGH9DnqTSjz3SYCMZ2ZekooXa&vjs=3&p=1&sk=&fvj=1" title="Private Care Jobs With Elder - Immediate Start - £550 to £750 pw" rel="noopener nofollow" onmousedown="sjomd('sja1'); clk('sja1');" onclick="setRefineByCookie([]); sjoc('sja1',0); convCtr('SJ')">Private Care Jobs With Elder - Immediate Start - £550 to £75...</a>
<br>
<div class="sjcl">
<span class="company">
Elder</span>
<span class="location">London</span>
</div>
<div class="">
<table cellpadding="0" cellspacing="0" border="0"><tbody><tr><td class="snip">
<span class="summary">
Pass a full DBS check or have a valid check already. Access to the internet and a smartphone. At Elder, we’re looking for caring individuals to join our...</span>
</td></tr></tbody></table>
</div>
<div class="sjCapt">
<div class="result-link-bar-container">
<div class="result-link-bar"><span class=" sponsoredGray ">Sponsored</span> - <span id="tt_set_10" class="tt_set"><a id="sj_81c3e09223cbc6b3" href="#" class="sl resultLink save-job-link " onclick="changeJobState('81c3e09223cbc6b3', 'save', 'linkbar', true, ''); return false;" title="Save this job to my.indeed">save job</a></span><div id="editsaved2_81c3e09223cbc6b3" class="edit_note_content" style="display:none;"></div><script>if (!window['sj_result_81c3e09223cbc6b3']) {window['sj_result_81c3e09223cbc6b3'] = {};}window['sj_result_81c3e09223cbc6b3']['showSource'] = false; window['sj_result_81c3e09223cbc6b3']['source'] = "Indeed"; window['sj_result_81c3e09223cbc6b3']['loggedIn'] = false; window['sj_result_81c3e09223cbc6b3']['showMyJobsLinks'] = false;window['sj_result_81c3e09223cbc6b3']['undoAction'] = "unsave";window['sj_result_81c3e09223cbc6b3']['jobKey'] = "81c3e09223cbc6b3"; window['sj_result_81c3e09223cbc6b3']['myIndeedAvailable'] = true; window['sj_result_81c3e09223cbc6b3']['showMoreActionsLink'] = window['sj_result_81c3e09223cbc6b3']['showMoreActionsLink'] || false; window['sj_result_81c3e09223cbc6b3']['resultNumber'] = 10; window['sj_result_81c3e09223cbc6b3']['jobStateChangedToSaved'] = false; window['sj_result_81c3e09223cbc6b3']['searchState'] = "l=London&start=20"; window['sj_result_81c3e09223cbc6b3']['basicPermaLink'] = "https://www.indeed.co.uk"; window['sj_result_81c3e09223cbc6b3']['saveJobFailed'] = false; window['sj_result_81c3e09223cbc6b3']['removeJobFailed'] = false; window['sj_result_81c3e09223cbc6b3']['requestPending'] = false; window['sj_result_81c3e09223cbc6b3']['notesEnabled'] = false; window['sj_result_81c3e09223cbc6b3']['currentPage'] = "serp"; window['sj_result_81c3e09223cbc6b3']['sponsored'] = true;window['sj_result_81c3e09223cbc6b3']['showSponsor'] = true;window['sj_result_81c3e09223cbc6b3']['reportJobButtonEnabled'] = false; window['sj_result_81c3e09223cbc6b3']['showMyJobsHired'] = false; window['sj_result_81c3e09223cbc6b3']['showSaveForSponsored'] = true; window['sj_result_81c3e09223cbc6b3']['showJobAge'] = true;</script></div></div>
<div class="tab-container">
<div class="sign-in-container result-tab"></div>
<div class="tellafriend-container result-tab email_job_content"></div>
</div>
</div>
</div>
所有卡具有相同的類“ .clickcard”,所有相關鏈接均具有類“ .turnstileLink”,但是當我嘗試對它們進行page.search或page.xpath時,我似乎無法獲得一致的結果,而不會出現與除了返回的元素數量不同之外,所有不同數組中的數據都正確。
所以我的問題是:如果我要抓取公司名稱,位置,職位,該頁面的網址以及其他可能的值,我該如何做呢?
我將不勝感激!
編輯:
contains()表達式需要更復雜:
contains(
concat(' ',normalize-space(@class),' '),
' turnstileLink '
)
以防止像turnstileLinkerCar
類的類匹配。 麻煩的是,我將doc.css()
與css選擇器(如a.turnstileLink
doc.css()
一起使用,它負責精確匹配可能包含多個類名的字符串中的指定類名。
嘗試:
doc.xpath('//a[contains(@class, "turnstileLink")]').each{ |e| links << e['href'] }
要么:
doc.css('a.turnstileLink').each{ |e| links << e['href'] }
這是問題所在:
require 'nokogiri'
my_html = %q{
<html>
<body>
<a href="aaa" class="c1">A link</a>
<a href="bbb" class="c1 c2">B link</a>
<a href="ccc" class="c2 c1">C link</a>
<a href="ddd" class="c2 c1 c3">D link</a>
</body>
</html>
}
doc = Nokogiri::HTML(my_html)
links = doc.xpath('//a[@class="c1"]').map{ |e| e["href"] }
p links
--output:--
["aaa"]
bbb鏈接的類別為"c1 c2"
,它不等於"c1"
。
對評論的回應 :
require 'nokogiri'
my_html = %q{
<html>
<body>
<div class="x">
<a href="aaa" class="c1">A link</a>
<a href="bbb" class="c1 c2">B link</a>
<a href="ccc" class="c2 c1">C link</a>
<div>
<a href="ddd" class="c2 c1 c3">D link</a>
</div>
</div>
<div class="y">
<a href="yyy" class="c1">Y link</a>
</div>
</body>
</html>
}
doc = Nokogiri::HTML(my_html)
links = doc.css('a.c1').map{ |e| e["href"] }
p links
--output:--
["aaa", "bbb", "ccc", "ddd", "yyy"]
但:
links = doc.css('div.x a.c1').map{ |e| e["href"] }
p links
--output:--
["aaa", "bbb", "ccc", "ddd"]
xpaths也是如此:
links = doc.xpath('//div[contains(@class, "x")]//a[contains(@class, "c1")]').map{ |e| e["href"] }
plinks
--output:--
["aaa", "bbb", "ccc", "ddd"]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.