简体   繁体   English

如何使用 Beautiful Soup 和 Pandas 或任何其他方法从网站以结构化格式捕获表格?

[英]How to capture table in a structured format from the website using Beautiful soup and Pandas or any other method?

I want to scrape the table ' Summary statement holding of specified securities ' from this website https://www.bseindia.com/stock-share-price/infosys-ltd/infy/500209/shareholding-pattern/ I tried scraping data using selenium but it was all in one column without any table and there is no unique identifier to this table.我想从本网站https://www.bseindia.com/stock-share-price/infosys-ltd/infy/500209/shareholding-pattern/抓取表格“指定证券汇总报表”我尝试使用抓取数据selenium 但它都在没有任何表的一列中,并且该表没有唯一标识符。 How to use pandas and Beautiful Soup to scrape the table in a structured format or any other method.如何使用 pandas 和 Beautiful Soup 以结构化格式或任何其他方法刮表。 This is the code I'm trying to figure out but it didn't work.这是我想弄清楚的代码,但没有用。

import requests
import pandas as pd

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0"
}

params = {
    'id': 0,
    'txtscripcd': '',
    'pagecont': '',
    'subject': ''
}

def main(url):
    r = requests.get(url, params=params, headers=headers)
    df = pd.read_html(r.content)[-1].iloc[:, :-1]
    print(df)

main("")

To load the table to DataFrame and csv, you can use this example:要将表加载到 DataFrame 和 csv,您可以使用以下示例:

import requests
import pandas as pd
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
api_url = 'https://api.bseindia.com/BseIndiaAPI/api/shpSecSummery_New/w?qtrid=&scripcode=500209'

soup = BeautifulSoup(requests.get(api_url, headers=headers).json()['Data'], 'lxml')
table = soup.select_one('b:contains("Summary statement holding of specified securities")').find_next('table')
df = pd.read_html(str(table))[0].iloc[2:, :]

df.to_csv('data.csv')

Saves data.csv (screenshot from LibreOffice):保存data.csv (来自 LibreOffice 的截图):

在此处输入图片说明

The data you are looking for is in served by the following API endpoint:您要查找的数据由以下 API 端点提供:

https://api.bseindia.com/BseIndiaAPI/api/shpSecSummery_New/w?qtrid=&scripcode=500209

Where scripcode is the unique identifier.其中, scripcode是唯一标识符。

The API is not checking for cookies/session so direct call to this endpoint would return you the data you are looking for. API 不会检查 cookie/会话,因此直接调用此端点将返回您正在查找的数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 Beautiful soup 和 Pandas 从该网站以结构化格式捕获表格? - How to capture table in a structured format from this website using Beautiful soup and Pandas? 如何抓取难以阅读的网站(大熊猫和漂亮的汤)? - How to scrape a website that has a difficult table to read (pandas & beautiful soup)? 使用漂亮的汤或 python 中的任何其他方法解析 Autosar arxml - Parsing Autosar arxml using beautiful soup or any other method in python 使用 Beautiful Soup 和 Pandas 从网站抓取数据 - Scraping data from website using Beautiful Soup and Pandas 如何使用 Selenium,Beautiful Soup,Pandas 从网站的多个页面中提取实际数据? - How to pulling actual data from multiple pages of website with using Selenium,Beautiful Soup ,Pandas? 使用Beautiful Soup和Pandas从网页获取表格 - Getting table from web page using Beautiful Soup and Pandas 如何使用 Beautiful Soup 和 Pandas 在多个网页中抓取 Table? - How to scrape Table in several webpages using Beautiful Soup and Pandas? 如何使用 Beautiful Soup 从网站获取值和项目名称 - How to get values and item name from website using Beautiful Soup 如何使用美丽的汤从网站下载图像 - how to download image from a website using beautiful soup 如何使用 Beautiful Soup 从网站上刮取 SVG 元素? - How to scrape SVG element from a website using Beautiful Soup?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM