如何在抓取网站时排除带有标签的特定文本？

Question

So,I am trying to scrape a website.,所以，我正在尝试抓取一个网站。，

import requests
#importing modules
search = requests.get('https://www.timesjobs.com/candidate/job-search.html?searchType=personalizedSearch&from=submit&txtKeywords=C%23&txtLocation=').text
soup = BeautifulSoup(search,'lxml')
jobs = soup.find_all("li",class_="clearfix job-bx wht-shd-bx")
for i in jobs:
    date_publishment = i.find("span",class_= "sim-posted").span.text
    if "few" in date_publishment:
        company_name = i.find("h3",class_= "joblist-comp-name" ).text.replace(" ","")
        company_skills = i.find("span",class_="srp-skills").text.replace(" ","")
        description =i.find("ul",class_='list-job-dtl clearfix').text
        #prints data---v
        print(f"Company Name:{company_name.strip()}")
        print(f"Skills:{company_skills.strip()}")
        print(f"Description:{description}")
        print("")

<li>
      <label>Job Description:</label>
Sophus is  looking for a Full stack developer with good experience in Dot net technologies for our product Talpal.What is the product you will work for?Talpal is a cloud based... <a href="https://www.timesjobs.com/job-detail/c-net-full-stack-developer-sophus-infotech-india-private-limited-chennai-2-to-4-yrs-jobid-w1ZrmDvR5__PLUS__1zpSvf__PLUS__uAgZw==&amp;source=srp" target="_blank">More Details</a>
      </li>

So,while trying to scrape out the description there are some issues being other tag's text included in the main(li)tag.So,Is there any way that I can only scrape out only Sophus is looking for a Full stack developer with good experience in Dot net technologies for our product Talpal.What is the product you will work for?Talpal is a cloud based...因此，在尝试删除描述时，存在一些问题是主（li）标签中包含其他标签的文本。所以，有什么方法可以让我只能删除Sophus 正在寻找具有良好经验的全栈开发人员在我们的产品 Talpal 的点网技术中。您将工作的产品是什么？Talpal 是一个基于云的...

Answer 1

You can use:contains to target the right label tag then next_sibling to move to the desciption.您可以使用:contains 定位正确的 label 标签，然后使用 next_sibling 移动到描述。 Eg within loop over job:例如在循环工作中：

i.select_one('label:contains("Job Description:")').next_sibling.strip()

如何在抓取网站时排除带有标签的特定文本？

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-05-09 00:42:54

如何在抓取网站时排除带有标签的特定文本？

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-05-09 00:42:54

解决方案1
0 已采纳 2021-05-09 00:42:54