简体   繁体   English

从一些 HTML 标签中提取文本

[英]Extracting the text from some HTML tags

I am using BeautifulSoup to webscrape job listings on a career page.我正在使用 BeautifulSoup 来抓取职业页面上的职位列表。 I am having trouble just printing out the information I need.我在打印我需要的信息时遇到了麻烦。

This is was the HTML looks like这是 HTML 的样子

<ul class="list-group">
<li class="list-group-item">
<h4 class="list-group-item-heading">
<a href="http://careers.steelseries.com/apply/3LXwyjYOrb/Customer-Experience-Specialist">
                                        Customer Experience Specialist                                    </a>
</h4>
<ul class="list-inline list-group-item-text">
<li><i class="fa fa-map-marker"></i>Chicago, IL</li>
<li><i class="fa fa-sitemap"></i>Operations</li>
</ul>

What I want it to print out is我想让它打印出来的是

Customer Experience Specialist
Chicago, IL
Operations
--------------

The code I tried is this:我试过的代码是这样的:

section = soup.find_all('div', class_='col col-xs-7 jobs-list')
for elem in section:
    wrappers = elem.find('ul').get_text()
    print(wrappers)

But what that does is print it for me with too many new lines and spaces as so:但是这样做是用太多的新行和空格为我打印它:

                                        Customer Experience Specialist                                    


Chicago, IL
Operations

Keep in mind there are also like 4 empty lines above the job title and another new line after 'Operations'请记住,职位名称上方还有 4 行空行,“操作”之后还有另一行新行

Try this:尝试这个:

sections = soup.find_all('div', class_='col col-xs-7 jobs-list')
sections = [section for section in sections.split("\n") if section and section != " "]
print("\n".join(sections))

Regards!问候!

After get_text() function add rstrip() to remove all trailing newlines .This removes all trailing whitespace, not just a single newline.在 get_text() 函数之后添加 rstrip() 以删除所有尾随换行符。这将删除所有尾随空格,而不仅仅是单个换行符。

Otherwise, if there is only one line in the string S, use S.splitlines()[0].否则,如果字符串 S 中只有一行,则使用 S.splitlines()[0]。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM