[英]Extracting the text from some HTML tags
I am using BeautifulSoup to webscrape job listings on a career page.我正在使用 BeautifulSoup 来抓取职业页面上的职位列表。 I am having trouble just printing out the information I need.
我在打印我需要的信息时遇到了麻烦。
This is was the HTML looks like这是 HTML 的样子
<ul class="list-group">
<li class="list-group-item">
<h4 class="list-group-item-heading">
<a href="http://careers.steelseries.com/apply/3LXwyjYOrb/Customer-Experience-Specialist">
Customer Experience Specialist </a>
</h4>
<ul class="list-inline list-group-item-text">
<li><i class="fa fa-map-marker"></i>Chicago, IL</li>
<li><i class="fa fa-sitemap"></i>Operations</li>
</ul>
What I want it to print out is我想让它打印出来的是
Customer Experience Specialist
Chicago, IL
Operations
--------------
The code I tried is this:我试过的代码是这样的:
section = soup.find_all('div', class_='col col-xs-7 jobs-list')
for elem in section:
wrappers = elem.find('ul').get_text()
print(wrappers)
But what that does is print it for me with too many new lines and spaces as so:但是这样做是用太多的新行和空格为我打印它:
Customer Experience Specialist
Chicago, IL
Operations
Keep in mind there are also like 4 empty lines above the job title and another new line after 'Operations'请记住,职位名称上方还有 4 行空行,“操作”之后还有另一行新行
Try this:尝试这个:
sections = soup.find_all('div', class_='col col-xs-7 jobs-list')
sections = [section for section in sections.split("\n") if section and section != " "]
print("\n".join(sections))
Regards!问候!
After get_text() function add rstrip() to remove all trailing newlines .This removes all trailing whitespace, not just a single newline.在 get_text() 函数之后添加 rstrip() 以删除所有尾随换行符。这将删除所有尾随空格,而不仅仅是单个换行符。
Otherwise, if there is only one line in the string S, use S.splitlines()[0].否则,如果字符串 S 中只有一行,则使用 S.splitlines()[0]。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.