繁体   English   中英

无法使用 Python 从网站上抓取数据

[英]Unable to scrape data from Website with Python

我想从“在交易所交易的债券”和“场外交易”中提取表格并将其保存到 Excel 表中。 我正在尝试使用 python ( BS & requests ) 抓取数据,但我无法抓取数据 ( 我不想使用 selenium )。 任何人都可以指导我吗? 我没有收到任何错误,它在 python 终端中没有得到处理我认为终端被挂起,因为我什至没有收到任何错误消息。


import requests
import pandas as pd
import os
from bs4 import BeautifulSoup as bs



url = "https://www1.nseindia.com/products/content/debt/corp_bonds/cbm_reporting_homepage.htm"

#condition  True
#while condition:

html = requests.get(url).content
page= requests.get(url)
soup= bs(page.text, 'lxml')
df_list = pd.read_html(html)
df = df_list[0]     # can change 0 to other number 
print(df)

如果您查看 Network 选项卡,您将看到cbm_reporting_cbricsL.htm这是您需要抓取的内容。 顺便说一句,您还应该为请求添加标头才能正常工作。 请参阅此线程中的详细说明:

import requests
import pandas as pd
from bs4 import BeautifulSoup

res = requests.get(
    'https://www1.nseindia.com/products/dynaContent/debt/corp_bonds/htms/cbm_reporting_cbricsL.htm',
    headers={"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"}
    )

soup = BeautifulSoup(res.text, 'lxml')

raw_columns = [row.find_all('td') for row in soup.find_all('tr')]

# first 3 items were dummy
df = pd.DataFrame.from_records(raw_columns[3:])

结果将如下所示:

0   [INE001A07TA7]  [HOUSING DEVELOPMENT FINANCE CORPORATION LTD S...   [ 100.0030] [ 4.7082]   [ 16]   [[ 168000.00]]  [ 100.0000] [ 4.7091]
1   [INE134E07AP6]  [POWER FINANCE CORPORATION LTD. TRI SRV CATIII...   [ 100.8500] [ 6.6934]   [ 1]    [ 1000.00 ] [ 100.8500] [ 6.6934]
2   [INE020B08963]  [RURAL ELECTRIFICATION CORPORATION LIMITED SR-...   [ 107.6835] [ 5.9200]   [ 1]    [ 1500.00 ] [ 107.6835] [ 5.9200]
3   [INE163N08131]  [-] [ 104.2195] [ 6.6200]   [ 1]    [ 780.00 ]  [ 104.2195] [ 6.6200]
4   [INE540P07343]  [-] [ 104.3408] [ 9.3603]   [ 6]    [[ 1110.00]]    [ 104.2640] [ 9.3800]
... ... ... ... ... ... ... ... ...
93  [INE377Y07250]  [BAJAJ HOUSING FINANCE LIMITED SR 27 5.69 NCD ...   [ 100.0300] [ 5.6845]   [ 1]    [ 9000.00 ] [ 100.0300] [ 5.6845]
94  [INE115A07ML7]  [LIC HOUSING FINANCE LIMITED SRTR349OP-1 7.4NC...   [ 105.0991] [ 5.5000]   [ 1]    [ 1000.00 ] [ 105.0991] [ 5.5000]
95  [INE020B07HN3]  [RURAL ELECTRIFICATION CORPORATION LIMITED SR-...   [ 123.6000] [ 4.4400]   [ 1]    [ 10.00 ]   [ 123.6000] [ 4.4400]
96  [INE101A08070]  [MAHINDRA AND MAHINDRA LIMITED 9.55 NCD 04JL63...   [ 125.5000] [ 7.5218]   [ 1]    [ 820.00 ]  [ 125.5000] [ 7.5218]
97  [INE062A08215]  [STATE BANK OF INDIA SERIES I 8.75 BD PERPETUA...   [ 104.5304] [ 7.0000]   [ 1]    [ 10.00 ]   [ 104.5304] [ 7.0000]

这就是我最终得到的。 我已经使用 pandas pd 从网页中提取第一个表

import requests
import pandas as pd

headers = {"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"}

html = requests.get(
    'abcd',
    headers=headers).content


df_list = pd.read_html(html)
df = df_list[0]
print (df)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM