[英]Scraping data table from Chinese stock website using Python
I am trying to scrape part of the data from a Chinese website. 我正在尝试从中文网站上抓取部分数据。 The website I want to scrape is: http://data.10jqka.com.cn/market/yybzjsl/HTZQGFYXGSCDSJLZQYYB
我要抓取的网站是: http : //data.10jqka.com.cn/market/yybzjsl/HTZQGFYXGSCDSJLZQYYB
I want to get the whole data table below: 我想获得下面的整个数据表:
There are 86 pages. 共有86页。 The code below does not succeed.
下面的代码不成功。 Can someone give me a hand?
有人可以帮我吗?
import urllib2, pandas,json
baseurl="http://data.10jqka.com.cn/interface/market/yybzjsl/desc/%s/20/"
def getdata(pgnum):
cururl = baseurl % str(pgnum)
##print cururl
cn=urllib2.urlopen(cururl)
jstbl=json.load(cn,encoding='gbk')['data']
return pandas.read_html('<table>'+jstbl+'</table>')[0]
dataout=pandas.DataFrame()
for pgnum in range(86):
print pgnum
totaltry=0
while True:
try:
curdata=getdata(pgnum+1)
curdata['pgnum']=pgnum+1
break
except:
totaltry+=1
print 'failed: %s' % totaltry
dataout=dataout.append(curdata, ignore_index=True)
dataout.to_csv('~/Desktop/dataout.csv')
I would suggest using Beautiful Soup to do the scraping. 我建议使用美丽汤做刮。
UPDATE: something like this (but I would recommend looking at the BS4 documentation for how to really use it)... 更新:像这样(但我建议您查看BS4文档以了解如何真正使用它)...
import urllib
from bs4 import BeautifulSoup, SoupStrainer
baseurl="http://data.10jqka.com.cn/market/yybzjsl/HTZQGFYXGSCDSJLZQYYB"
page = urllib.urlopen(baseurl)
getonly = SoupStrainer('table')
table = BeautifulSoup(page, parse_only=getonly)
for row in table("tr"):
text = ''.join(row.findAll(text=True))
data = text.strip()
print data
Gets you: 让您:
...
2015-02-04
吴通通讯
日换手率达20%的证券
10.02%
买入
3489.00
1084.38
7.43%
通信设备
2015-02-03
赢时胜
日换手率达20%的证券
5.57%
卖出
1065.53
646.77
6.07%
计算机应用
2015-01-30
京天利
日涨幅偏离值达7%的证券
10.00%
买入
2363.95
1698.03
13.68%
计算机应用
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.