[英]Crawl data from an html table in python
I am a beginner in web crawling and I need help in getting the values from the table.我是 web 爬行的初学者,我需要帮助从表中获取值。 I have got all the required fields (LOCATION,DATE,SUMMARY,DEADLINE).
我有所有必填字段(位置、日期、摘要、截止日期)。 What I want is the Summary is having a url to another page.
我想要的是摘要有一个 url 到另一个页面。 I want that url to get appended along with the other fields like (LOCATION,DATE,SUMMARY,DEADLINE, URL )
我希望 url 与其他字段一起附加,例如 (LOCATION,DATE,SUMMARY,DEADLINE, URL )
This is my code so far.到目前为止,这是我的代码。 But its not working
但它不工作
import requests as rq
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.tendersinfo.com/global-information-technology-tenders-{}.php'
amount_of_pages = 2 #5194
rows = []
for i in range(1,amount_of_pages):
response = rq.get(url.format(i))
if response.status_code == 200:
soup = BeautifulSoup(response.text,'html.parser')
table = soup.find('table',{'id':'datatable'})
headers = []
for th in table.find("tr").find_all("th"):
headers.append(th.text.strip())
for tr in table.find_all("tr")[1:]:
cells = []
tds = tr.find_all("td")
if len(tds) == 0:
ths = tr.find_all("th")
for th in ths:
cells.append(th.text.strip())
else:
for td in tds:
cells.append(td.text.strip())
cells.append('https://www.tendersinfo.com/' + td.find('a')['href'])
rows.append(cells)
Here you go, I just re-coded the majority of it.给你 go,我只是重新编码了其中的大部分。
import requests as rq
from bs4 import BeautifulSoup
import pandas as pd
location = []
posted_date = []
summary = []
deadline = []
url = 'https://www.tendersinfo.com/global-information-technology-tenders-{}.php'
amount_of_pages = 10 # Max is 5194 currently
rows = []
for i in range(1,amount_of_pages):
response = rq.get(url.format(i))
if response.status_code == 200:
soup = BeautifulSoup(response.text,'html.parser')
table = soup.find('table',{'id':'datatable'})
headers = []
for th in table.find("tr").find_all("th"):
headers.append(th.text.strip())
for tr in table.find_all("tr")[1:]:
cells = []
tds = tr.find_all("td")
if len(tds) == 0:
ths = tr.find_all("th")
for th in ths:
cells.append(th.text.strip())
else:
for td in tds:
cells.append(td.text.strip())
rows.append(cells)
pd.DataFrame(rows, columns=headers).to_csv(r"C:\Users\HP\Desktop\Web Scraping (RFP's)\RFP_SCRAPED_DATA.csv", index=False)
Since your using pandas, why not use read_html which return's extracted tables as a list of DataFrames.由于您使用 pandas,为什么不使用read_html返回提取的表作为数据帧列表。
>>> tables = pd.read_html("https://www.tendersinfo.com/global-information-technology-tenders.php")
>>> tables[1]
LOCATION DATE SUMMARY DEADLINE
0 India 21-May-2020 Liquid Crystal Display Lcd Panel Or Monitors. 01-Jun-2020
1 India 21-May-2020 Random Access Memory. 01-Jun-2020
2 India 21-May-2020 Supply Of Analog Transceiver-handheld. 01-Jun-2020
3 India 21-May-2020 Supply Of Computer Printers. 01-Jun-2020
4 India 21-May-2020 All In One Pc. 01-Jun-2020
You get the table easily using pd.read_html
and save this data into csv
file using df.to_csv()
.您可以使用
pd.read_html
轻松获取表格,并使用df.to_csv()
将此数据保存到csv
文件中。
import pandas as pd
url = "https://www.tendersinfo.com/ajax_all_new_search.php?country=information-technology&increment=1&%20select=500&%20total=259655&%20search_id=19906&%20order=id&%20imagevalue=1"
df = pd.read_html(url)[0]
df.to_csv("RFP_SCRAPED_DATA.csv", index=False)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.