![](/img/trans.png)
[英]How to extract Table contents from an HTML page using BeautifulSoup in Python?
[英]Extract Columns from html using Python (Beautifulsoup)
我需要从此页面中提取信息-http: //www.investing.com/currencies/usd-brl-historical-data 。 我需要日期,价格,开盘价,最高价,最低价,更改%。 我是Python的新手,所以在这一步遇到了麻烦:
import requests
from bs4 import BeautifulSoup
from datetime import datetime
url='http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup=BeautifulSoup(r.content,'lxml')
g_data = soup.find_all('table', {'class':'genTbl closedTbl historicalTbl'})
d=[]
for item in g_data:
Table_Values = item.find_all('tr')
N=len(Table_Values)-1
for n in range(N):
k = (item.find_all('td', {'class':'first left bold noWrap'})[n].text)
print(item.find_all('td', {'class':'first left bold noWrap'})[n].text)
在这里我有几个问题:
价格列可以标记为或。 如何指定需要class ='redFont'或/和'greenfont'标签的商品? 。 同样,Change%还可以具有redFont和greenFont类。 其他列用标记。 如何提取它们?
有没有办法从表中提取列?
理想情况下,我希望有一个带有日期,价格,开盘价,高价,低价,变动%列的日期框架。
谢谢
这是将html表转换为嵌套列表的方法
解决方案是找到特定的表,然后遍历表中的每个tr,创建该tr内所有项目的文本的子列表。 执行此操作的代码是嵌套列表理解。
import requests
from bs4 import BeautifulSoup
from pprint import pprint
url='http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
table = soup.find("table", {"id" : "curr_table"})
#first row is empty
tableRows = [[td.text for td in row.find_all("td")] for row in table.find_all("tr")[1:]]
pprint(tableRows)
这将从表中获取所有数据
[['Jun 08, 2016', '3.3614', '3.4411', '3.4465', '3.3584', '-2.34%'],
['Jun 07, 2016', '3.4421', '3.4885', '3.5141', '3.4401', '-1.36%'],
['Jun 06, 2016', '3.4896', '3.5265', '3.5295', '3.4840', '-1.09%'],
['Jun 05, 2016', '3.5280', '3.5280', '3.5280', '3.5280', '0.11%'],
['Jun 03, 2016', '3.5240', '3.5910', '3.5947', '3.5212', '-1.91%'],
['Jun 02, 2016', '3.5926', '3.6005', '3.6157', '3.5765', '-0.22%'],
['Jun 01, 2016', '3.6007', '3.6080', '3.6363', '3.5755', '-0.29%'],
['May 31, 2016', '3.6111', '3.5700', '3.6383', '3.5534', '1.11%'],
['May 30, 2016', '3.5713', '3.6110', '3.6167', '3.5675', '-1.11%'],
['May 27, 2016', '3.6115', '3.5824', '3.6303', '3.5792', '0.81%'],
['May 26, 2016', '3.5825', '3.5826', '3.5857', '3.5757', '-0.03%'],
['May 25, 2016', '3.5836', '3.5702', '3.6218', '3.5511', '0.34%'],
['May 24, 2016', '3.5713', '3.5717', '3.5903', '3.5417', '-0.04%'],
['May 23, 2016', '3.5728', '3.5195', '3.5894', '3.5121', '1.49%'],
['May 20, 2016', '3.5202', '3.5633', '3.5663', '3.5154', '-1.24%'],
['May 19, 2016', '3.5644', '3.5668', '3.6197', '3.5503', '-0.11%'],
['May 18, 2016', '3.5683', '3.4877', '3.5703', '3.4854', '2.28%'],
['May 17, 2016', '3.4888', '3.4990', '3.5300', '3.4812', '-0.32%'],
['May 16, 2016', '3.5001', '3.5309', '3.5366', '3.4944', '-0.96%'],
['May 13, 2016', '3.5340', '3.4845', '3.5345', '3.4630', '1.39%'],
['May 12, 2016', '3.4855', '3.4514', '3.5068', '3.4346', '0.95%'],
['May 11, 2016', '3.4528', '3.4755', '3.4835', '3.4389', '-0.66%'],
['May 10, 2016', '3.4758', '3.5155', '3.5173', '3.4623', '-1.15%'],
['May 09, 2016', '3.5164', '3.5010', '3.6766', '3.4906', '0.40%']]
如果要将其转换为熊猫数据框,则只需获取表标题并将其添加
import requests
from bs4 import BeautifulSoup
import pandas
from pprint import pprint
url='http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
table = soup.find("table", {"id" : "curr_table"})
tableRows = [[td.text for td in row.find_all("td")] for row in table.find_all("tr")[1:]]
#get headers for dataframe
tableHeaders = [th.text for th in table.find_all("th")]
#build df from tableRows and headers
df = pandas.DataFrame(tableRows, columns=tableHeaders)
print(df)
然后,您将获得一个如下所示的数据框:
Date Price Open High Low Change %
0 Jun 08, 2016 3.3596 3.4411 3.4465 3.3584 -2.40%
1 Jun 07, 2016 3.4421 3.4885 3.5141 3.4401 -1.36%
2 Jun 06, 2016 3.4896 3.5265 3.5295 3.4840 -1.09%
3 Jun 05, 2016 3.5280 3.5280 3.5280 3.5280 0.11%
4 Jun 03, 2016 3.5240 3.5910 3.5947 3.5212 -1.91%
5 Jun 02, 2016 3.5926 3.6005 3.6157 3.5765 -0.22%
6 Jun 01, 2016 3.6007 3.6080 3.6363 3.5755 -0.29%
7 May 31, 2016 3.6111 3.5700 3.6383 3.5534 1.11%
8 May 30, 2016 3.5713 3.6110 3.6167 3.5675 -1.11%
9 May 27, 2016 3.6115 3.5824 3.6303 3.5792 0.81%
10 May 26, 2016 3.5825 3.5826 3.5857 3.5757 -0.03%
11 May 25, 2016 3.5836 3.5702 3.6218 3.5511 0.34%
12 May 24, 2016 3.5713 3.5717 3.5903 3.5417 -0.04%
13 May 23, 2016 3.5728 3.5195 3.5894 3.5121 1.49%
14 May 20, 2016 3.5202 3.5633 3.5663 3.5154 -1.24%
15 May 19, 2016 3.5644 3.5668 3.6197 3.5503 -0.11%
16 May 18, 2016 3.5683 3.4877 3.5703 3.4854 2.28%
17 May 17, 2016 3.4888 3.4990 3.5300 3.4812 -0.32%
18 May 16, 2016 3.5001 3.5309 3.5366 3.4944 -0.96%
19 May 13, 2016 3.5340 3.4845 3.5345 3.4630 1.39%
20 May 12, 2016 3.4855 3.4514 3.5068 3.4346 0.95%
21 May 11, 2016 3.4528 3.4755 3.4835 3.4389 -0.66%
22 May 10, 2016 3.4758 3.5155 3.5173 3.4623 -1.15%
23 May 09, 2016 3.5164 3.5010 3.6766 3.4906 0.40%
如何从该站点解析表我已经在这里回答了,但是由于您想要一个DataFrame ,只需使用pandas.read_html
url = 'http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
import pandas as pd
df = pd.read_html(r.content,attrs = {'id': 'curr_table'})[0]
这会给你:
Date Price Open High Low Change %
0 Jun 08, 2016 3.3609 3.4411 3.4465 3.3584 -2.36%
1 Jun 07, 2016 3.4421 3.4885 3.5141 3.4401 -1.36%
2 Jun 06, 2016 3.4896 3.5265 3.5295 3.4840 -1.09%
3 Jun 05, 2016 3.5280 3.5280 3.5280 3.5280 0.11%
4 Jun 03, 2016 3.5240 3.5910 3.5947 3.5212 -1.91%
5 Jun 02, 2016 3.5926 3.6005 3.6157 3.5765 -0.22%
6 Jun 01, 2016 3.6007 3.6080 3.6363 3.5755 -0.29%
7 May 31, 2016 3.6111 3.5700 3.6383 3.5534 1.11%
8 May 30, 2016 3.5713 3.6110 3.6167 3.5675 -1.11%
9 May 27, 2016 3.6115 3.5824 3.6303 3.5792 0.81%
10 May 26, 2016 3.5825 3.5826 3.5857 3.5757 -0.03%
11 May 25, 2016 3.5836 3.5702 3.6218 3.5511 0.34%
12 May 24, 2016 3.5713 3.5717 3.5903 3.5417 -0.04%
13 May 23, 2016 3.5728 3.5195 3.5894 3.5121 1.49%
14 May 20, 2016 3.5202 3.5633 3.5663 3.5154 -1.24%
15 May 19, 2016 3.5644 3.5668 3.6197 3.5503 -0.11%
16 May 18, 2016 3.5683 3.4877 3.5703 3.4854 2.28%
17 May 17, 2016 3.4888 3.4990 3.5300 3.4812 -0.32%
18 May 16, 2016 3.5001 3.5309 3.5366 3.4944 -0.96%
19 May 13, 2016 3.5340 3.4845 3.5345 3.4630 1.39%
20 May 12, 2016 3.4855 3.4514 3.5068 3.4346 0.95%
21 May 11, 2016 3.4528 3.4755 3.4835 3.4389 -0.66%
22 May 10, 2016 3.4758 3.5155 3.5173 3.4623 -1.15%
23 May 09, 2016 3.5164 3.5010 3.6766 3.4906 0.40%
通常,您通常可以直接传递url,但是使用urllib2 (对于read_html使用的lib),我们会在此特定网站上收到403错误 ,因此我们需要使用请求来获取该html。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.