簡體   English   中英

當類不起作用時,使用Nokogiri定位a-tag的元素

[英]Targeting elements of an a-tag with Nokogiri when classes dont work

我正在嘗試制造刮板,並且在以下方面需要幫助:

我想從一個a-tag抓取一堆數據,並將一些div / span嵌套在同一div中。 我的代碼如下所示:

  page = Nokogiri::HTML(open(website))

  page.search('.company').each { |e| companies << e.text.strip }
  page.search('.jobtitle').each { |e| jobtitles << e.text.strip }
  page.search('.location').each { |e| locations << e.text.strip }

  page.xpath('//a[@class="turnstileLink"]').map{ |e| links << e['href'] }

對於前三個(公司,標題和位置),我得到16個或15個結果,但是對於最后一個搜索,我的數組僅包含10個元素。 奇怪的是,它們還不匹配其他數組之一的前10個,而是開始匹配其他數組之一的第3或第4個元素附近的某個位置。

我要定位的典型卡片的html在這里:

<div class="row result clickcard" id="pj_81c3e09223cbc6b3" data-jk="81c3e09223cbc6b3" data-advn="4563763653116462" data-tu="">
        <a target="_blank" id="sja1" data-tn-element="jobTitle" class="jobtitle turnstileLink" href="/pagead/clk?mo=r&amp;ad=-6NYlbfkN0DhDTzlYIMy8YIuVE6IrMC_kH05KGZgoAT6LTrcTn8STrwXoiuruouegXiAvJy4qud6xIecRibm3b0Q5eOBkpCiV3R04sAyQbvP7gt6NKZVpCRp32eFzXudmk-TIABX3xEZGo90a47Vz9OofqZaLDh37545RNQ3sFjM6VzWNEWwKf_YoXxeGKcAICj9AADyBuYAY7p9UIUxoox7J5U9gO8Zo2dvRW-i5FJtaUr49Vjsl04W0Jp-CN2azbfp6rrfT6RYFbJ_YAc2iI-L37eeygDtI4KXQwv_elrV8ZLEKo9rkcfEzbE129kX7JKeEq5wJ1dj7GJ4ONH1lIPJQd1gJLoqNYJVQlLTKJiBP72Z0RBmgfZQ-69U8AoEyMT6pytz6iqykLCnO-SxClmvFPJsNV96oBGzpMWtWQeVgGQ49jZfBBRq9Ubw7N73iEjCv6oQ70hcW1P4d8DYK0pCI7vu2KfUh0P9vx8AKC6wY2QoAZeeP4OiBIJ8ikKSIUYJTbe3UwKcLYP7r_3_rx1gY_JO1ReG21ctCxfqGH9DnqTSjz3SYCMZ2ZekooXa&amp;vjs=3&amp;p=1&amp;sk=&amp;fvj=1" title="Private Care Jobs With Elder - Immediate Start - £550 to £750 pw" rel="noopener nofollow" onmousedown="sjomd('sja1'); clk('sja1');" onclick="setRefineByCookie([]); sjoc('sja1',0); convCtr('SJ')">Private Care Jobs With Elder - Immediate Start - £550 to £75...</a>

        <br>
        <div class="sjcl">
        <span class="company">
Elder</span>

<span class="location">London</span>
        </div>
        <div class="">
            <table cellpadding="0" cellspacing="0" border="0"><tbody><tr><td class="snip">
                    <span class="summary">
                        Pass a full DBS check or have a valid check already. Access to the internet and a smartphone. At Elder, we’re looking for caring individuals to join our...</span>
                </td></tr></tbody></table>
            </div>

            <div class="sjCapt">
                <div class="result-link-bar-container">
                        <div class="result-link-bar"><span class=" sponsoredGray ">Sponsored</span> - <span id="tt_set_10" class="tt_set"><a id="sj_81c3e09223cbc6b3" href="#" class="sl resultLink save-job-link " onclick="changeJobState('81c3e09223cbc6b3', 'save', 'linkbar', true, ''); return false;" title="Save this job to my.indeed">save job</a></span><div id="editsaved2_81c3e09223cbc6b3" class="edit_note_content" style="display:none;"></div><script>if (!window['sj_result_81c3e09223cbc6b3']) {window['sj_result_81c3e09223cbc6b3'] = {};}window['sj_result_81c3e09223cbc6b3']['showSource'] = false; window['sj_result_81c3e09223cbc6b3']['source'] = "Indeed"; window['sj_result_81c3e09223cbc6b3']['loggedIn'] = false; window['sj_result_81c3e09223cbc6b3']['showMyJobsLinks'] = false;window['sj_result_81c3e09223cbc6b3']['undoAction'] = "unsave";window['sj_result_81c3e09223cbc6b3']['jobKey'] = "81c3e09223cbc6b3"; window['sj_result_81c3e09223cbc6b3']['myIndeedAvailable'] = true; window['sj_result_81c3e09223cbc6b3']['showMoreActionsLink'] = window['sj_result_81c3e09223cbc6b3']['showMoreActionsLink'] || false; window['sj_result_81c3e09223cbc6b3']['resultNumber'] = 10; window['sj_result_81c3e09223cbc6b3']['jobStateChangedToSaved'] = false; window['sj_result_81c3e09223cbc6b3']['searchState'] = "l=London&amp;start=20"; window['sj_result_81c3e09223cbc6b3']['basicPermaLink'] = "https://www.indeed.co.uk"; window['sj_result_81c3e09223cbc6b3']['saveJobFailed'] = false; window['sj_result_81c3e09223cbc6b3']['removeJobFailed'] = false; window['sj_result_81c3e09223cbc6b3']['requestPending'] = false; window['sj_result_81c3e09223cbc6b3']['notesEnabled'] = false; window['sj_result_81c3e09223cbc6b3']['currentPage'] = "serp"; window['sj_result_81c3e09223cbc6b3']['sponsored'] = true;window['sj_result_81c3e09223cbc6b3']['showSponsor'] = true;window['sj_result_81c3e09223cbc6b3']['reportJobButtonEnabled'] = false; window['sj_result_81c3e09223cbc6b3']['showMyJobsHired'] = false; window['sj_result_81c3e09223cbc6b3']['showSaveForSponsored'] = true; window['sj_result_81c3e09223cbc6b3']['showJobAge'] = true;</script></div></div>
                    <div class="tab-container">
                        <div class="sign-in-container result-tab"></div>
                        <div class="tellafriend-container result-tab email_job_content"></div>
                    </div>
                </div>
        </div>

所有卡具有相同的類“ .clickcard”,所有相關鏈接均具有類“ .turnstileLink”,但是當我嘗試對它們進行page.search或page.xpath時,我似乎無法獲得一致的結果,而不會出現與除了返回的元素數量不同之外,所有不同數組中的數據都正確。

所以我的問題是:如果我要抓取公司名稱,位置,職位,該頁面的網址以及其他可能的值,我該如何做呢?

我將不勝感激!

編輯:

contains()表達式需要更復雜:

contains(
      concat(' ',normalize-space(@class),' '),
      ' turnstileLink '  
)

以防止像turnstileLinkerCar類的類匹配。 麻煩的是,我將doc.css()與css選擇器(如a.turnstileLink doc.css()一起使用,它負責精確匹配可能包含多個類名的字符串中的指定類名。


嘗試:

doc.xpath('//a[contains(@class, "turnstileLink")]').each{ |e| links << e['href'] }

要么:

doc.css('a.turnstileLink').each{ |e| links << e['href'] }

這是問題所在:

require 'nokogiri'

my_html = %q{
<html>
  <body>
    <a href="aaa" class="c1">A link</a>
    <a href="bbb" class="c1 c2">B link</a>
    <a href="ccc" class="c2 c1">C link</a>
    <a href="ddd" class="c2 c1 c3">D link</a>
  </body>
</html>
}

doc = Nokogiri::HTML(my_html)
links = doc.xpath('//a[@class="c1"]').map{ |e| e["href"] }

p links

--output:--
["aaa"]

bbb鏈接的類別為"c1 c2" ,它不等於"c1"

對評論的回應

require 'nokogiri'

my_html = %q{
<html>
  <body>
  <div class="x">
    <a href="aaa" class="c1">A link</a>
    <a href="bbb" class="c1 c2">B link</a>
    <a href="ccc" class="c2 c1">C link</a>
    <div>
      <a href="ddd" class="c2 c1 c3">D link</a>
    </div>
  </div>
  <div class="y">
    <a href="yyy" class="c1">Y link</a>
  </div>
  </body>
</html>
}

doc = Nokogiri::HTML(my_html)
links = doc.css('a.c1').map{ |e| e["href"] }
p links

--output:--
["aaa", "bbb", "ccc", "ddd", "yyy"]

但:

links = doc.css('div.x  a.c1').map{ |e| e["href"] }
p links
--output:--
["aaa", "bbb", "ccc", "ddd"]

xpaths也是如此:

links = doc.xpath('//div[contains(@class, "x")]//a[contains(@class, "c1")]').map{ |e| e["href"] }
plinks

--output:--
["aaa", "bbb", "ccc", "ddd"]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM