[英]How to scrape charts from a website with python?
編輯:
所以我將下面的腳本代碼保存到一個文本文件中,但是使用re提取數據仍然不會返回任何信息。 我的代碼是:
file_object = open('source_test_script.txt', mode="r")
soup = BeautifulSoup(file_object, "html.parser")
pattern = re.compile(r"^var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);$", re.MULTILINE | re.DOTALL)
scripts = soup.find("script", text=pattern)
profile_text = pattern.search(scripts.text).group(1)
profile = json.loads(profile_text)
print profile["data"], profile["categories"]
我想從網站上提取圖表數據。 以下是圖表的源代碼。
<script type="text/javascript">
jQuery(function() {
var chart1 = new Highcharts.Chart({
chart: {
renderTo: 'chart1',
defaultSeriesType: 'column',
borderWidth: 2
},
title: {
text: 'Productions'
},
legend: {
enabled: false
},
xAxis: [{
categories: [1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016],
}],
yAxis: {
min: 0,
title: {
text: 'Productions'
}
},
series: [{
name: 'Productions',
data: [1,1,0,1,6,4,9,15,15,19,24,18,53,42,54,53,61,36]
}]
});
});
</script>
網站上有一些類似的圖表,稱為“ chart1”,“ chart2”等。我想提取以下數據:每個圖表的類別線和數據線:
categories: [1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016]
data: [1,1,0,1,6,4,9,15,15,19,24,18,53,42,54,53,61,36]
另一種方法是像在控制台中一樣使用Highcharts的JavaScript庫,並使用Selenium來拉取它。
import time
from selenium import webdriver
website = ""
driver = webdriver.Firefox()
driver.get(website)
time.sleep(5)
temp = driver.execute_script('return window.Highcharts.charts[0]'
'.series[0].options.data')
data = [item[1] for item in temp]
print(data)
根據要嘗試繪制圖表的系列和圖表,可能會略有不同。
我會結合使用正則表達式和yaml解析器。 下面又快又臟-您可能需要對正則表達式進行tweek,但是它可以與示例一起使用:
import re
import sys
import yaml
chart_matcher = re.compile(r'^var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);$',
re.MULTILINE | re.DOTALL)
script = sys.stdin.read()
m = chart_matcher.findall(script)
for name, data in m:
print name
try:
chart = yaml.safe_load(data)
print "categories:", chart['xAxis'][0]['categories']
print "data:", chart['series'][0]['data']
except Exception, e:
print e
需要yaml庫( pip install PyYAML
),在將其傳遞給正則表達式之前,應使用BeautifulSoup提取正確的<script>
標記。
編輯 -完整示例
對不起,我沒有說清楚。 您使用BeautifulSoup解析HTML並提取<script>
元素,然后使用PyYAML解析javascript對象聲明。 您不能使用內置的json庫,因為它不是有效的JSON,但是普通的javascript對象聲明(即,沒有函數)是YAML的子集。
from bs4 import BeautifulSoup
import yaml
import re
file_object = open('source_test_script.txt', mode="r")
soup = BeautifulSoup(file_object, "html.parser")
pattern = re.compile(r"var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);", re.MULTILINE | re.DOTALL | re.UNICODE)
charts = {}
# find every <script> tag in the source using beautifulsoup
for tag in soup.find_all('script'):
# tabs are special in yaml so remove them first
script = tag.text.replace('\t', '')
# find each object declaration
for name, obj_declaration in pattern.findall(script):
try:
# parse the javascript declaration
charts[name] = yaml.safe_load(obj_declaration)
except Exception, e:
print "Failed to parse {0}: {1}".format(name, e)
# extract the data you want
for name in charts:
print "## {0} ##".format(name);
print "categories:", charts[name]['xAxis'][0]['categories']
print "data:", charts[name]['series'][0]['data']
print
輸出:
## chart1 ##
categories: [1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016]
data: [22, 1, 0, 1, 6, 4, 9, 15, 15, 19, 24, 18, 53, 42, 54, 53, 61, 36]
請注意,我不得不對正則表達式進行周處理,以使其處理BeautifulSoup的unicode輸出和空格-在我的原始示例中,我只是將您的源直接通過管道傳遞給了正則表達式。
編輯2-沒有Yaml
鑒於javascript看起來是部分生成的,因此您可以期望的最好成績是搶占先機-雖然不太雅致,但可能會對您有用。
from bs4 import BeautifulSoup
import json
import re
file_object = open('citec.repec.org_p_c_pcl20.html', mode="r")
soup = BeautifulSoup(file_object, "html.parser")
pattern = re.compile(r"var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);", re.MULTILINE | re.DOTALL | re.UNICODE)
charts = {}
for tag in soup.find_all('script'):
# tabs are special in yaml so remove them first
script = tag.text
values = {}
# find each object declaration
for name, obj_declaration in pattern.findall(script):
for line in obj_declaration.split('\n'):
line = line.strip('\t\n ,;')
for field in ('data', 'categories'):
if line.startswith(field + ":"):
data = line[len(field)+1:]
try:
values[field] = json.loads(data)
except:
print "Failed to parse %r for %s" % (data, name)
charts[name] = values
print charts
請注意,它對於chart7失敗,因為它引用了另一個變量。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.