繁体   English   中英

使用beautifulsoup从一张表中抓取时出现Web抓取问题

[英]Web scraping issue while scraping from one table using beautifulsoup

import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.freejobalert.com/ap-govt-jobs/144586/')
c = page.content
soup = BeautifulSoup(c,"html5lib")
row = soup.find_all("table")[0].find_all('tr')
dict = {}
for i in row:
    for title in i.find_all('span', attrs={'style':'color: #008000;'}):
        dict['Title'] = title.text
    for link in i.find_all('a',title=True, href=True):
        dict['Link'] = link['href']
        print(dict)

在这里,我得到的数据为空:

我期望:

{'Link': 'http://www.freejobalert.com/wp-content/uploads/2018/08/Detailed-Notification-Directorate-of-Public-Health-Family-Welfare-Vijayawada-Civil-Assistant-Surgeon-Posts.pdf', 'Title': 'Detailed Notification'}
{'Link': 'http://www.freejobalert.com/wp-content/uploads/2018/08/Notification-Directorate-of-Public-Health-Family-Welfare-Vijayawada-Civil-Assistant-Surgeon-Posts.pdf', 'Title': 'Notification '}
{'Link': 'http://cfw.ap.nic.in/', 'Title': ' Official Website'}

在这里,我只从第一个表中抓取数据。 但这给了我所有表的数据。我只想要第一个表的重要链接。 但这给了我两个。 请看一下我的代码。

我测试了您的代码,它对我来说运行正常,但是我将dict的名称更改为some_dict,如下所示:

import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.freejobalert.com/ap-govt-jobs/144586/')
c = page.content
soup = BeautifulSoup(c,"html5lib")
row = soup.find_all("table")[0].find_all('tr')
some_dict = {}
for i in row:
    for title in i.find_all('span', attrs={'style': 'color: #008000;'}):
        some_dict['Title'] = title.text
    for link in i.find_all('a', title=True, href=True):
        some_dict['Link'] = link['href']
        print(some_dict)

由于它掩盖了Python的内置dict类。 我的输出是:

{'Title': 'Detailed Notification', 'Link': 'http://www.freejobalert.com/wp-content/uploads/2018/08/Detailed-Notification-Directorate-of-Public-Health-Family-Welfare-Vijayawada-Civil-Assistant-Surgeon-Posts.pdf'}
{'Title': 'Notification ', 'Link': 'http://www.freejobalert.com/wp-content/uploads/2018/08/Notification-Directorate-of-Public-Health-Family-Welfare-Vijayawada-Civil-Assistant-Surgeon-Posts.pdf'}
{'Title': ' Official Website', 'Link': 'http://cfw.ap.nic.in/'}

如果将dict重命名为其他名称,它可以运行吗?

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM