简体   繁体   English

Selenium xpath 无法定位 class

[英]Selenium xpath unable to locate class

I am working on a project to scrape data.我正在做一个抓取数据的项目。 I have a for loop that runs through 50 urls (all of which are the same page with just different information) and then I extract different things to add to a csv.我有一个贯穿 50 个 url 的 for 循环(所有这些都是同一个页面,只是信息不同),然后我提取不同的内容以添加到 csv。 The problem I am having is that when I try to extract 'job_title' in my code, many of the entries come up as 'None', though the entry is actually existent.我遇到的问题是,当我尝试在代码中提取“job_title”时,许多条目都显示为“无”,尽管该条目实际上存在。 The HTML seems to be same in each URL, but 10/50 urls are yielding 'NONE' to the following lines of code. HTML 在每个 URL 中似乎都是相同的,但是 10/50 的 URL 对以下代码行产生了“NONE”。 I need the code to set job_title = 'Founder'我需要代码来设置 job_title = 'Founder'

This is the code I am currently using:这是我目前使用的代码:

sel = Selector(text=driver.page_source) 
job_title = sel.xpath('//*[starts-with(@class, "t-16 t-black t-bold")]/text()').extract_first()

Here is the HTML from one of the urls that I was unable to extract job_title--Which is 'Founder' in this case.这是我无法提取 job_title 的 URL 之一的 HTML——在这种情况下是“创始人”。 It is the second line of the script.这是脚本的第二行。

 <div class="pv-entity__summary-info pv-entity__summary-info--background-section mb2"> <h3 class="t-16 t-black t-bold">Founder</h3> <p class="visually-hidden">Company Name</p> <p class="pv-entity__secondary-title t-14 t-black t-normal"> Genamint <span class="pv-entity__secondary-title separator">Full-time</span> </p> <div class="display-flex"> <h4 class="pv-entity__date-range t-14 t-black--light t-normal"> <span class="visually-hidden">Dates Employed</span> <span>Mar 2020 – Present</span> </h4> <h4 class="t-14 t-black--light t-normal"> <span class="visually-hidden">Employment Duration</span> <span class="pv-entity__bullet-item-v2">5 mos</span> </h4> </div> <h4 class="pv-entity__location t-14 t-black--light t-normal block"> <span class="visually-hidden">Location</span> <span>New York, United States</span> </h4> <!----> </div>

Any help would be appreciated.任何帮助,将不胜感激。

Both those lines grab this HTML.这两条线都抓住了这个 HTML。 <h3 class="nav-settings__member-name t-16 t-black t-bold> Ethan Roberti </h3>

There's no nav-settings__member-name in your sample data.您的示例数据中没有nav-settings__member-name Since you're using extract_first() , you get the first appearing result.由于您使用的是extract_first() ,因此您会得到第一个出现的结果。 One way to fix it would be:解决它的一种方法是:

(//div[contains(@class,"pv-entity__summary")])[1]//h3/text()

Output: Founder Output: Founder

Assuming you're trying to scrape LinkedIn, to get the current or last job of a person, use the following XPath:假设您正在尝试抓取 LinkedIn,以获取某人当前或上一份工作,请使用以下 XPath:

(//section[@class="experience pp-section" or @id="experience-section"]//h3)[1]/text()

For example, for https://www.linkedin.com/in/ethan-roberti-322694174 , you'll get:例如,对于https://www.linkedin.com/in/ethan-roberti-322694174 ,您将获得:

Output: Summer Analyst Output: Summer Analyst

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM