从一些 HTML 标签中提取文本

Question

I am using BeautifulSoup to webscrape job listings on a career page.我正在使用 BeautifulSoup 来抓取职业页面上的职位列表。 I am having trouble just printing out the information I need.我在打印我需要的信息时遇到了麻烦。

This is was the HTML looks like这是 HTML 的样子

<ul class="list-group">
<li class="list-group-item">
<h4 class="list-group-item-heading">
<a href="http://careers.steelseries.com/apply/3LXwyjYOrb/Customer-Experience-Specialist">
                                        Customer Experience Specialist                                    </a>
</h4>
<ul class="list-inline list-group-item-text">
<li><i class="fa fa-map-marker"></i>Chicago, IL</li>
<li><i class="fa fa-sitemap"></i>Operations</li>
</ul>

What I want it to print out is我想让它打印出来的是

Customer Experience Specialist
Chicago, IL
Operations
--------------

The code I tried is this:我试过的代码是这样的：

section = soup.find_all('div', class_='col col-xs-7 jobs-list')
for elem in section:
    wrappers = elem.find('ul').get_text()
    print(wrappers)

But what that does is print it for me with too many new lines and spaces as so:但是这样做是用太多的新行和空格为我打印它：

                                        Customer Experience Specialist                                    


Chicago, IL
Operations

Keep in mind there are also like 4 empty lines above the job title and another new line after 'Operations'请记住，职位名称上方还有 4 行空行，“操作”之后还有另一行新行

Answer 1

Try this:尝试这个：

sections = soup.find_all('div', class_='col col-xs-7 jobs-list')
sections = [section for section in sections.split("\n") if section and section != " "]
print("\n".join(sections))

Regards!问候！

Answer 2

After get_text() function add rstrip() to remove all trailing newlines .This removes all trailing whitespace, not just a single newline.在 get_text() 函数之后添加 rstrip() 以删除所有尾随换行符。这将删除所有尾随空格，而不仅仅是单个换行符。

Otherwise, if there is only one line in the string S, use S.splitlines()[0].否则，如果字符串 S 中只有一行，则使用 S.splitlines()[0]。

从一些 HTML 标签中提取文本

问题描述

2 个解决方案

解决方案1
1 2020-02-21 17:09:12

解决方案2
0 2020-02-21 17:08:43

从一些 HTML 标签中提取文本

问题描述

2 个解决方案

解决方案1 1 2020-02-21 17:09:12

解决方案2 0 2020-02-21 17:08:43

解决方案1
1 2020-02-21 17:09:12

解决方案2
0 2020-02-21 17:08:43