简体   繁体   English

使用python从一个网站内的多个html表中收集数据

[英]Scraping data from multiple html tables within one website in python

I am trying to get a timeseries from this website into python: http://www.boerse-frankfurt.de/en/etfs/db+x+trackers+msci+world+information+technology+trn+index+ucits+etf+LU0540980496/price+turnover+history/historical+data#page=1 我正在尝试从该网站获取有关Python的时间序列: http//www.boerse-frankfurt.de/en/etfs/db+x+trackers+msci+world+information+technology+trn+index+ucits+etf + LU0540980496 /价格+成交+历史/历史数据+#页= 1

I've gotten pretty far, but don't know how to get all the data and not just the first 50 rows which you can see on the page. 我已经走了很远,但是不知道如何获取所有数据,而不仅仅是页面上可以看到的前50行。 To view them online, you have to click through the results at the bottom of the table. 要在线查看它们,您必须单击表格底部的结果。 I would like to be able to specify a start and end date in python and get all the corresponding dates and prices in a list. 我希望能够在python中指定开始和结束日期,并在列表中获取所有相应的日期和价格。 Here is what I have so far: 这是我到目前为止的内容:

 from bs4 import BeautifulSoup
 import requests
 import lxml
 import re

 url = 'http://www.boerse-frankfurt.de/en/etfs/db+x+trackers+msci+world+information+technology+trn+index+ucits+etf+LU0540980496/price+turnover+history/historical+data'
 soup = BeautifulSoup(requests.get(url).text)

 dates  = soup.findAll('td', class_='column-date')
 dates  = [re.sub('[\\nt\s]','',d.string) for d in dates]
 prices = soup.findAll('td', class_='column-price')
 prices = [re.sub('[\\nt\s]','',p.string) for p in prices]

You need to loop through the rest of the pages. 您需要遍历其余页面。 You can use POST request to do that. 您可以使用POST请求来执行此操作。 The server expects to receive a structure in each POST request. 服务器希望在每个POST请求中接收一个结构。 The structure is defined below in values . 该结构在下面的值中定义。 The page number is the parameter 'page' of that structure. 页码是该结构的参数“ page” The structure has several parameters I have not tested but that could be interesting to try, like items_per_page , max_time and min_time . 该结构具有一些我尚未测试过的参数,但是可以尝试一些有趣的参数,例如items_per_pagemax_timemin_time Here below is an example code: 以下是示例代码:

 from bs4 import BeautifulSoup import urllib import urllib2 import re url = 'http://www.boerse-frankfurt.de/en/parts/boxes/history/_histdata_full.m' values = {'COMPONENT_ID':'PREeb7da7a4f4654f818494b6189b755e76', 'ag':'103708549', 'boerse_id': '12', 'include_url': '/parts/boxes/history/_histdata_full.m', 'item_count': '96', 'items_per_page': '50', 'lang': 'en', 'link_id': '', 'max_time': '2014-09-20', 'min_time': '2014-05-09', 'page': 1, 'page_size': '50', 'pages_total': '2', 'secu': '103708549', 'template': '0', 'titel': '', 'title': '', 'title_link': '', 'use_external_secu': '1'} dates = [] prices = [] while True: data = urllib.urlencode(values) request = urllib.urlopen(url, data) soup = BeautifulSoup(request.read()) temp_dates = soup.findAll('td', class_='column-date') temp_dates = [re.sub('[\\\\nt\\s]','',d.string) for d in temp_dates] temp_prices = soup.findAll('td', class_='column-price') temp_prices = [re.sub('[\\\\nt\\s]','',p.string) for p in temp_prices] if not temp_prices: break else: dates = dates + temp_dates prices = prices + temp_prices values['page'] += 1 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM