简体   繁体   中英

How to remove a hyperlink tag under a header tag using beautifulSoup -

I am trying to web scrape a webpage. Here I want to extract only Freelancer from the header H3. but when I run the below code I get "More jobs" which is under 'a' tag . How to extract only Freelancer from below link?

https://www.timesjobs.com/candidate/job-search.html?searchType=personalizedSearch&from=submit&txtKeywords=work+from+home&txtLocation=

my code is:

company_name = job.find('h3', class_='joblist-comp-name').text

Result is: Freelancer (More Jobs)

Expected: Freelancer

You can simply split the string based on space and extract the first text

from bs4 import BeautifulSoup
html="""<h3 class="joblist-comp-name">Freelancer <a class="jobs-frm-comp" href="/candidate/companySearchResult.html?from=submit&encid=V1VUNYG9OfxywnPTmYOKIg==&searchType=byCompany&luceneResultSize=25">(More Jobs)</h3>
"""
soup=BeautifulSoup(html,"lxml")
soup.find("h3",class_="joblist-comp-name").text.split(" ")[0]

Output:

'Freelancer'

Update with URL given

import requests
from bs4 import BeautifulSoup

res=requests.get("https://www.timesjobs.com/candidate/job-search.html?searchType=personalizedSearch&from=submit&txtKeywords=work+from+home&txtLocation=")
soup=BeautifulSoup(res.text,"lxml")

Here it will find main ul tag and from it find all li tag so it will return as list from that we can go for first element and we can find the text associate to it!

all_li=soup.find("ul",class_="new-joblist").find_all("li")
all_li[0].find("h3",class_="joblist-comp-name").get_text(strip=True).split("(")[0]

Output:

'Freelancer'

Your html is not well formed, but if it's fixed like this:

<h3 class="joblist-comp-name"> Freelancer 
  <a class="jobs-frm-comp" href="/whatever">  More Jobs</a>
</h3>

something like the below should get you there - it uses the lxml library and xpath search to zero in on the target. Obviously, you'll have to modify it to fit your actual html:

import lxml.html as lh
company = """the modified html string above"""
job = lh.fromstring(company)
job.xpath('//h3[@class="joblist-comp-name"]/text()')[0].strip()

Output:

'Freelancer'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM