简体   繁体   English

如何使用python从网站抓取图表?

[英]How to scrape charts from a website with python?

EDIT: 编辑:

So I have save the script codes below to a text file but using re to extract the data still doesn't return me anything. 所以我将下面的脚本代码保存到一个文本文件中,但是使用re提取数据仍然不会返回任何信息。 My code is: 我的代码是:

file_object = open('source_test_script.txt', mode="r")
soup = BeautifulSoup(file_object, "html.parser")
pattern = re.compile(r"^var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);$", re.MULTILINE | re.DOTALL)
scripts = soup.find("script", text=pattern)
profile_text = pattern.search(scripts.text).group(1)
profile = json.loads(profile_text)

print profile["data"], profile["categories"]

I would like to extract the chart's data from a website. 我想从网站上提取图表数据。 The following is the source code of the chart. 以下是图表的源代码。

  <script type="text/javascript">
    jQuery(function() {

    var chart1 = new Highcharts.Chart({

          chart: {
             renderTo: 'chart1',
              defaultSeriesType: 'column',
            borderWidth: 2
          },
          title: {
             text: 'Productions'
          },
          legend: {
            enabled: false
          },
          xAxis: [{
             categories: [1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016],

          }],
          yAxis: {
             min: 0,
             title: {
             text: 'Productions'
          }
          },

          series: [{
               name: 'Productions',
               data: [1,1,0,1,6,4,9,15,15,19,24,18,53,42,54,53,61,36]
               }]
       });
    });

    </script>

There are several charts like that from the website, called "chart1", "chart2", etc. I would like to extract the following data: the categories line and the data line, for each chart: 网站上有一些类似的图表,称为“ chart1”,“ chart2”等。我想提取以下数据:每个图表的类别线和数据线:

categories: [1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016]

data: [1,1,0,1,6,4,9,15,15,19,24,18,53,42,54,53,61,36]

Another way is to use Highcharts' JavaScript Library as one would in the console and pull that using Selenium. 另一种方法是像在控制台中一样使用Highcharts的JavaScript库,并使用Selenium来拉取它。

import time
from selenium import webdriver

website = ""

driver = webdriver.Firefox()
driver.get(website)
time.sleep(5)

temp = driver.execute_script('return window.Highcharts.charts[0]'
                             '.series[0].options.data')
data = [item[1] for item in temp]
print(data)

Depending on what chart and series you are trying to pull your case might be slightly different. 根据要尝试绘制图表的系列和图表,可能会略有不同。

I'd go a combination of regex and yaml parser. 我会结合使用正则表达式和yaml解析器。 Quick and dirty below - you may need to tweek the regex but it works with example: 下面又快又脏-您可能需要对正则表达式进行tweek,但是它可以与示例一起使用:

import re
import sys
import yaml

chart_matcher = re.compile(r'^var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);$',
        re.MULTILINE | re.DOTALL)

script = sys.stdin.read()

m = chart_matcher.findall(script)

for name, data in m:
    print name
    try:
        chart = yaml.safe_load(data)
        print "categories:", chart['xAxis'][0]['categories']
        print "data:", chart['series'][0]['data']
    except Exception, e:
        print e

Requires the yaml library ( pip install PyYAML ) and you should use BeautifulSoup to extract the correct <script> tag before passing it to the regex. 需要yaml库( pip install PyYAML ),在将其传递给正则表达式之前,应使用BeautifulSoup提取正确的<script>标记。

EDIT - full example 编辑 -完整示例

Sorry I didn't make myself clear. 对不起,我没有说清楚。 You use BeautifulSoup to parse the HTML and extract the <script> elements, and then use PyYAML to parse the javascript object declaration. 您使用BeautifulSoup解析HTML并提取<script>元素,然后使用PyYAML解析javascript对象声明。 You can't use the built in json library because its not valid JSON but plain javascript object declarations (ie with no functions) are a subset of YAML. 您不能使用内置的json库,因为它不是有效的JSON,但是普通的javascript对象声明(即,没有函数)是YAML的子集。

from bs4 import BeautifulSoup
import yaml
import re

file_object = open('source_test_script.txt', mode="r")
soup = BeautifulSoup(file_object, "html.parser")

pattern = re.compile(r"var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);", re.MULTILINE | re.DOTALL | re.UNICODE)

charts = {}

# find every <script> tag in the source using beautifulsoup
for tag in soup.find_all('script'):

    # tabs are special in yaml so remove them first
    script = tag.text.replace('\t', '')

    # find each object declaration
    for name, obj_declaration in pattern.findall(script):
        try:
            # parse the javascript declaration
            charts[name] = yaml.safe_load(obj_declaration)
        except Exception, e:
            print "Failed to parse {0}: {1}".format(name, e)

# extract the data you want
for name in charts:
    print "## {0} ##".format(name);
    print "categories:", charts[name]['xAxis'][0]['categories']
    print "data:", charts[name]['series'][0]['data']
    print

Output: 输出:

## chart1 ##
categories: [1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016]
data: [22, 1, 0, 1, 6, 4, 9, 15, 15, 19, 24, 18, 53, 42, 54, 53, 61, 36]

Note I had to tweek the regex to make it handle the unicode output and whitespace from BeautifulSoup - in my original example I just piped your source directly to the regex. 请注意,我不得不对正则表达式进行周处理,以使其处理BeautifulSoup的unicode输出和空格-在我的原始示例中,我只是将您的源直接通过管道传递给了正则表达式。

EDIT 2 - no yaml 编辑2-没有Yaml

Given that the javascript looks to be partially generated the best you can hope for is to grab the lines - not elegant but will probably work for you. 鉴于javascript看起来是部分生成的,因此您可以期望的最好成绩是抢占先机-虽然不太雅致,但可能会对您有用。

from bs4 import BeautifulSoup
import json
import re

file_object = open('citec.repec.org_p_c_pcl20.html', mode="r")
soup = BeautifulSoup(file_object, "html.parser")

pattern = re.compile(r"var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);", re.MULTILINE | re.DOTALL | re.UNICODE)

charts = {}

for tag in soup.find_all('script'):

    # tabs are special in yaml so remove them first
    script = tag.text

    values = {}

    # find each object declaration
    for name, obj_declaration in pattern.findall(script):
        for line in obj_declaration.split('\n'):
            line = line.strip('\t\n ,;')
            for field in ('data', 'categories'):
                if line.startswith(field + ":"):
                    data = line[len(field)+1:]
                    try:
                        values[field] = json.loads(data)
                    except:
                        print "Failed to parse %r for %s" % (data, name)

        charts[name] = values

print charts

Note that it fails for chart7 because that references another variable. 请注意,它对于chart7失败,因为它引用了另一个变量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM