如何使用 Beautiful soup 和 Pandas 从该网站以结构化格式捕获表格？

Question

I want to scrape the table from this website "" and as it keeps getting updated hourly want to track changes as well.我想从这个网站“”中抓取表格，因为它每小时不断更新，所以我也想跟踪更改。 I tried scraping data using selenium but it was all in one column without any table.我尝试使用 selenium 抓取数据，但它全部在一列中，没有任何表格。 How to use pandas and Beautiful Soup to scrape the table in a structured format and track changes as well.如何使用 pandas 和 Beautiful Soup 以结构化格式抓取表格并跟踪更改。 This is the code I'm trying to figure out.这是我想弄清楚的代码。

import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
table = soup.find('table', attrs={'id':'subs noBorders evenRows'})
table_rows = table.find_all('tr')

res = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td if tr.text.strip()]
    if row:
        res.append(row)


df = pd.DataFrame(res, columns=["Notice No","Subject","Segment Name","Category Name","Department","PDF"])
print(df)

It would be a help if you can help me getting data and how to keep track of new data whenever I run the script again.如果您能帮助我获取数据以及如何在我再次运行脚本时跟踪新数据，那将会很有帮助。

Answer 1

Be informed that you don't need to include params as the desired information presented within main page.请注意，您不需要将params作为主页中显示的所需信息包含在内。 I've left it for you in case if you will scrape different id .我已经把它留给你了，以防你抓取不同的id 。

Also be informed that i skipped PDF as it's will shown NAN values since the pdf links is not an hyperlink .另请注意，我跳过了PDF ，因为它会显示NAN值，因为pdf链接不是hyperlink 。 it's jsut a logo icon which stored within the server.它只是一个存储在服务器中的logo图标。 but once you click on the pdf logo, then it's make a post request to the target to download the file.但是，一旦您单击pdf徽标，它就会向目标发出发布请求以下载文件。 Based on that without a clear information from you so here's an answer for your requirements.基于此，您没有提供明确的信息，因此这里是您的要求的答案。

import requests
import pandas as pd

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0"
}

params = {
    'id': 0,
    'txtscripcd': '',
    'pagecont': '',
    'subject': ''
}


def main(url):
    r = requests.get(url, params=params, headers=headers)
    df = pd.read_html(r.content)[-1].iloc[:, :-1]
    print(df)


main("https://www.bseindia.com/markets/MarketInfo/NoticesCirculars.aspx")

Output: Output：

    Notice No   Subject     Segment Name    Category Name   Department
0   20200923-2  Offer to Buy – Acquisition Window (Delisting) ...   Equity  Trading     Trading Operations
1   20200923-1  Change in Name of the Company.  Debt    Company related     Listing Operations

如何使用 Beautiful soup 和 Pandas 从该网站以结构化格式捕获表格？

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-09-22 23:04:54

如何使用 Beautiful soup 和 Pandas 从该网站以结构化格式捕获表格？

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-09-22 23:04:54

解决方案1
1 已采纳 2020-09-22 23:04:54