[英]How to capture table in a structured format from this website using Beautiful soup and Pandas?
I want to scrape the table from this website "" and as it keeps getting updated hourly want to track changes as well.我想从这个网站“”中抓取表格,因为它每小时不断更新,所以我也想跟踪更改。 I tried scraping data using selenium but it was all in one column without any table.
我尝试使用 selenium 抓取数据,但它全部在一列中,没有任何表格。 How to use pandas and Beautiful Soup to scrape the table in a structured format and track changes as well.
如何使用 pandas 和 Beautiful Soup 以结构化格式抓取表格并跟踪更改。 This is the code I'm trying to figure out.
这是我想弄清楚的代码。
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
table = soup.find('table', attrs={'id':'subs noBorders evenRows'})
table_rows = table.find_all('tr')
res = []
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text.strip() for tr in td if tr.text.strip()]
if row:
res.append(row)
df = pd.DataFrame(res, columns=["Notice No","Subject","Segment Name","Category Name","Department","PDF"])
print(df)
It would be a help if you can help me getting data and how to keep track of new data whenever I run the script again.如果您能帮助我获取数据以及如何在我再次运行脚本时跟踪新数据,那将会很有帮助。
Be informed that you don't need to include params
as the desired information presented within main page.请注意,您不需要将
params
作为主页中显示的所需信息包含在内。 I've left it for you in case if you will scrape different id
.我已经把它留给你了,以防你抓取不同的
id
。
Also be informed that i skipped PDF
as it's will shown NAN
values since the pdf
links is not an hyperlink
.另请注意,我跳过了
PDF
,因为它会显示NAN
值,因为pdf
链接不是hyperlink
。 it's jsut a logo
icon which stored within the server.它只是一个存储在服务器中的
logo
图标。 but once you click on the pdf
logo, then it's make a post request to the target to download the file.但是,一旦您单击
pdf
徽标,它就会向目标发出发布请求以下载文件。 Based on that without a clear information from you so here's an answer for your requirements.基于此,您没有提供明确的信息,因此这里是您的要求的答案。
import requests
import pandas as pd
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0"
}
params = {
'id': 0,
'txtscripcd': '',
'pagecont': '',
'subject': ''
}
def main(url):
r = requests.get(url, params=params, headers=headers)
df = pd.read_html(r.content)[-1].iloc[:, :-1]
print(df)
main("https://www.bseindia.com/markets/MarketInfo/NoticesCirculars.aspx")
Output: Output:
Notice No Subject Segment Name Category Name Department
0 20200923-2 Offer to Buy – Acquisition Window (Delisting) ... Equity Trading Trading Operations
1 20200923-1 Change in Name of the Company. Debt Company related Listing Operations
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.