[英]Unable to scrape data from Website with Python
我想从“在交易所交易的债券”和“场外交易”中提取表格并将其保存到 Excel 表中。 我正在尝试使用 python ( BS & requests ) 抓取数据,但我无法抓取数据 ( 我不想使用 selenium )。 任何人都可以指导我吗? 我没有收到任何错误,它在 python 终端中没有得到处理我认为终端被挂起,因为我什至没有收到任何错误消息。
import requests
import pandas as pd
import os
from bs4 import BeautifulSoup as bs
url = "https://www1.nseindia.com/products/content/debt/corp_bonds/cbm_reporting_homepage.htm"
#condition True
#while condition:
html = requests.get(url).content
page= requests.get(url)
soup= bs(page.text, 'lxml')
df_list = pd.read_html(html)
df = df_list[0] # can change 0 to other number
print(df)
如果您查看 Network 选项卡,您将看到cbm_reporting_cbricsL.htm
这是您需要抓取的内容。 顺便说一句,您还应该为请求添加标头才能正常工作。 请参阅此线程中的详细说明:
import requests
import pandas as pd
from bs4 import BeautifulSoup
res = requests.get(
'https://www1.nseindia.com/products/dynaContent/debt/corp_bonds/htms/cbm_reporting_cbricsL.htm',
headers={"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"}
)
soup = BeautifulSoup(res.text, 'lxml')
raw_columns = [row.find_all('td') for row in soup.find_all('tr')]
# first 3 items were dummy
df = pd.DataFrame.from_records(raw_columns[3:])
结果将如下所示:
0 [INE001A07TA7] [HOUSING DEVELOPMENT FINANCE CORPORATION LTD S... [ 100.0030] [ 4.7082] [ 16] [[ 168000.00]] [ 100.0000] [ 4.7091]
1 [INE134E07AP6] [POWER FINANCE CORPORATION LTD. TRI SRV CATIII... [ 100.8500] [ 6.6934] [ 1] [ 1000.00 ] [ 100.8500] [ 6.6934]
2 [INE020B08963] [RURAL ELECTRIFICATION CORPORATION LIMITED SR-... [ 107.6835] [ 5.9200] [ 1] [ 1500.00 ] [ 107.6835] [ 5.9200]
3 [INE163N08131] [-] [ 104.2195] [ 6.6200] [ 1] [ 780.00 ] [ 104.2195] [ 6.6200]
4 [INE540P07343] [-] [ 104.3408] [ 9.3603] [ 6] [[ 1110.00]] [ 104.2640] [ 9.3800]
... ... ... ... ... ... ... ... ...
93 [INE377Y07250] [BAJAJ HOUSING FINANCE LIMITED SR 27 5.69 NCD ... [ 100.0300] [ 5.6845] [ 1] [ 9000.00 ] [ 100.0300] [ 5.6845]
94 [INE115A07ML7] [LIC HOUSING FINANCE LIMITED SRTR349OP-1 7.4NC... [ 105.0991] [ 5.5000] [ 1] [ 1000.00 ] [ 105.0991] [ 5.5000]
95 [INE020B07HN3] [RURAL ELECTRIFICATION CORPORATION LIMITED SR-... [ 123.6000] [ 4.4400] [ 1] [ 10.00 ] [ 123.6000] [ 4.4400]
96 [INE101A08070] [MAHINDRA AND MAHINDRA LIMITED 9.55 NCD 04JL63... [ 125.5000] [ 7.5218] [ 1] [ 820.00 ] [ 125.5000] [ 7.5218]
97 [INE062A08215] [STATE BANK OF INDIA SERIES I 8.75 BD PERPETUA... [ 104.5304] [ 7.0000] [ 1] [ 10.00 ] [ 104.5304] [ 7.0000]
这就是我最终得到的。 我已经使用 pandas pd 从网页中提取第一个表
import requests
import pandas as pd
headers = {"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"}
html = requests.get(
'abcd',
headers=headers).content
df_list = pd.read_html(html)
df = df_list[0]
print (df)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.