使用Python和Beautifulsoup进行爬取

Question

Following on from Beautifulsoup - trouble scraping datalist with links in it 从Beautifulsoup继续-在数据列表中抓取链接时遇到麻烦

This is an example of the HTML I'm scraping with Python/Beautifulsoup: 这是我使用Python / Beautifulsoup抓取的HTML的示例：

<dl>
<dd>
    <strong>
        <a name="45933" href="http://www.eslcafe.com/jobs/china/index.cgi?read=45933">TOP RANKING UNIVERSITY SEEKS PROFESSIONAL LECTURERS</a>
    </strong>
    <br>
    Chongqing University -- Tuesday, 14 March 2017, at 6:58 a.m.
</dd>

<dd></dd>
<dd></dd>
<dd></dd>
</dl>

This is my program: 这是我的程序：

import bs4 as bs
import urllib.request


def chinaJobs():
    sauce = urllib.request.urlopen('http://www.eslcafe.com/jobs/china/').read()

    soup = bs.BeautifulSoup(sauce, 'html.parser')

    ads = []

    for dd in soup.find_all('dd'):
        link = dd.a.get('href')
        link_text = dd.a.text
        link_text = link_text.lower()
        *_, dd_text = dd.stripped_strings

        if 'university' in link_text:
            ads.append([link, link_text, dd_text])

    for ad in ads:          
        for job in ad:
            print(job)
        print("")

chinaJobs()

I can get the information after the <br> tag, but it's the wrong information. 我可以在<br>标记之后获取信息，但这是错误的信息。 This is the what the information on the website looks like: 这是网站上的信息的样子：

TOP RANKING UNIVERSITY SEEKS PROFESSIONAL LECTURERS 排名最高的大学课程专业讲师

Chongqing University -- Tuesday, 14 March 2017, at 6:58 am 重庆大学-2017年3月14日，星期二，上午6:58

This is what I would like my output to look like: 这是我希望输出看起来像的样子：

http://www.eslcafe.com/jobs/china/index.cgi?read=45933
top ranking university seeks professional lecturers
Chongqing University -- Tuesday, 14 March 2017, at 6:58 a.m.

This is what my output looks like: 这是我的输出结果：

http://www.eslcafe.com/jobs/china/index.cgi?read=45933
top ranking university seeks professional lecturers
EnglishTeacherChina.com -- Sunday, 12 February 2017, at 1:45 p.m.

This is printed with every output: 这与每个输出一起打印：

EnglishTeacherChina.com -- Sunday, 12 February 2017, at 1:45 pm EnglishTeacherChina.com-2017年2月12日，星期日，下午1:45

Why do you think it is doing this, and what can I do to fix it? 您为什么认为它正在执行此操作，我该如何解决？

Answer 1

The string you are looking for is wrapped in the <br> tag, one option is to simply use br to extract it: 您要查找的字符串包装在<br>标记中，一种选择是简单地使用br提取它：

soup.find("dd").a.text
# u'TOP RANKING UNIVERSITY SEEKS PROFESSIONAL LECTURERS'

soup.find('dd').a.get("href")
# u'http://www.eslcafe.com/jobs/china/index.cgi?read=45933'

soup.find('dd').br.text.strip()
# u'Chongqing University -- Tuesday, 14 March 2017, at 6:58 a.m.'

You can try change the dd_text line to dd_text = dd.br.text.strip() . 您可以尝试将dd_text行更改为dd_text = dd.br.text.strip() 。

使用Python和Beautifulsoup进行爬取

问题描述

1 个解决方案

解决方案1
0 2017-03-14 23:03:37

使用Python和Beautifulsoup进行爬取

问题描述

1 个解决方案

解决方案1 0 2017-03-14 23:03:37

解决方案1
0 2017-03-14 23:03:37