简体   繁体   English

如何使用 Beautiful soup 和 Pandas 从该网站以结构化格式捕获表格?

[英]How to capture table in a structured format from this website using Beautiful soup and Pandas?

I want to scrape the table from this website "" and as it keeps getting updated hourly want to track changes as well.我想从这个网站“”中抓取表格,因为它每小时不断更新,所以我也想跟踪更改。 I tried scraping data using selenium but it was all in one column without any table.我尝试使用 selenium 抓取数据,但它全部在一列中,没有任何表格。 How to use pandas and Beautiful Soup to scrape the table in a structured format and track changes as well.如何使用 pandas 和 Beautiful Soup 以结构化格式抓取表格并跟踪更改。 This is the code I'm trying to figure out.这是我想弄清楚的代码。

import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
table = soup.find('table', attrs={'id':'subs noBorders evenRows'})
table_rows = table.find_all('tr')

res = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td if tr.text.strip()]
    if row:
        res.append(row)


df = pd.DataFrame(res, columns=["Notice No","Subject","Segment Name","Category Name","Department","PDF"])
print(df)

It would be a help if you can help me getting data and how to keep track of new data whenever I run the script again.如果您能帮助我获取数据以及如何在我再次运行脚本时跟踪新数据,那将会很有帮助。

Be informed that you don't need to include params as the desired information presented within main page.请注意,您不需要将params作为主页中显示的所需信息包含在内。 I've left it for you in case if you will scrape different id .我已经把它留给你了,以防你抓取不同的id

Also be informed that i skipped PDF as it's will shown NAN values since the pdf links is not an hyperlink .另请注意,我跳过了PDF ,因为它会显示NAN值,因为pdf链接不是hyperlink it's jsut a logo icon which stored within the server.它只是一个存储在服务器中的logo图标。 but once you click on the pdf logo, then it's make a post request to the target to download the file.但是,一旦您单击pdf徽标,它就会向目标发出发布请求以下载文件。 Based on that without a clear information from you so here's an answer for your requirements.基于此,您没有提供明确的信息,因此这里是您的要求的答案。

import requests
import pandas as pd

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0"
}

params = {
    'id': 0,
    'txtscripcd': '',
    'pagecont': '',
    'subject': ''
}


def main(url):
    r = requests.get(url, params=params, headers=headers)
    df = pd.read_html(r.content)[-1].iloc[:, :-1]
    print(df)


main("https://www.bseindia.com/markets/MarketInfo/NoticesCirculars.aspx")

Output: Output:

在此处输入图像描述

    Notice No   Subject     Segment Name    Category Name   Department
0   20200923-2  Offer to Buy – Acquisition Window (Delisting) ...   Equity  Trading     Trading Operations
1   20200923-1  Change in Name of the Company.  Debt    Company related     Listing Operations

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 Beautiful Soup 和 Pandas 或任何其他方法从网站以结构化格式捕获表格? - How to capture table in a structured format from the website using Beautiful soup and Pandas or any other method? 如何抓取难以阅读的网站(大熊猫和漂亮的汤)? - How to scrape a website that has a difficult table to read (pandas & beautiful soup)? 使用 Beautiful Soup 和 Pandas 从网站抓取数据 - Scraping data from website using Beautiful Soup and Pandas 如何使用 Selenium,Beautiful Soup,Pandas 从网站的多个页面中提取实际数据? - How to pulling actual data from multiple pages of website with using Selenium,Beautiful Soup ,Pandas? 使用Beautiful Soup和Pandas从网页获取表格 - Getting table from web page using Beautiful Soup and Pandas 如何使用 Beautiful Soup 和 Pandas 在多个网页中抓取 Table? - How to scrape Table in several webpages using Beautiful Soup and Pandas? 如何使用 Beautiful Soup 从网站获取值和项目名称 - How to get values and item name from website using Beautiful Soup 如何使用美丽的汤从网站下载图像 - how to download image from a website using beautiful soup 如何使用 Beautiful Soup 从网站上刮取 SVG 元素? - How to scrape SVG element from a website using Beautiful Soup? 如何使用 Beautiful Soup 从网站检索信息? - How to retrieve information from a website using Beautiful Soup?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM