简体   繁体   English

使用Scrapy获取JavaScript函数的参数

[英]Get the parameters of a JavaScript function with Scrapy

I was wondering if it is possible to extract the parameters of a JavaScript function with Scrapy, from a code similar to this one: 我想知道是否可以使用类似于此代码的代码从Scrapy中提取JavaScript函数的参数:

<script type="text/javascript">
    var map;
  function initialize() {
    var fenway = new google.maps.LatLng(43.2640611,2.9388228);
  };
}
</script>

I would like to extract the coordinates 43.2640611 and 2.9388228 . 我想提取坐标43.26406112.9388228

This is where re() method would help. 这是re()方法有用的地方。

The idea is to locate the script tag via xpath() and use re() to extract the lat and lng from the script tag's contents. 想法是通过xpath()定位script标记,并使用re()script标记的内容中提取latlng Demo from the scrapy shell : 来自scrapy shell演示:

$ scrapy shell index.html
>>> response.xpath('//script').re(r'new google\.maps\.LatLng\(([0-9.]+),([0-9.]+)\);')
[u'43.2640611', u'2.9388228']

where index.html contains: 其中index.html包含:

<script type="text/javascript">
    var map;
  function initialize() {
    var fenway = new google.maps.LatLng(43.2640611,2.9388228);
  };
}
</script>

Of course, in your case the xpath would not be just //script . 当然,在你的情况下,xpath不仅仅是//script

FYI, new google\\.maps\\.LatLng\\(([0-9.]+),([0-9.]+)\\); 仅供参考, new google\\.maps\\.LatLng\\(([0-9.]+),([0-9.]+)\\); regular expression uses the saving groups ([0-9.]+) to extract the coordinate values. 正则表达式使用保存组 ([0-9.]+)来提取坐标值。

Also see Using selectors with regular expressions . 另请参阅使用具有正则表达式的选择器

Disclaimer: I haven't tried this approach, but here's how I would think about it if I was constrained to using Scrapy and didn't want to parse JavaScript the way alecxe suggested above. 免责声明:我没有尝试过这种方法,但如果我被限制使用Scrapy并且不想按照alecxe建议的方式解析JavaScript,我会考虑如何。 This is a finicky, fragile hack :-) 这是一个挑剔,脆弱的黑客:-)

You can try using scrapyjs to execute the JavaScript code from your scrapy crawler. 您可以尝试使用scrapyjs从scrapy搜寻器中执行JavaScript代码。 In order to capture those parameters, you'd need to do the following: 要捕获这些参数,您需要执行以下操作:

  1. Load the original page and save it to disk. 加载原始页面并将其保存到磁盘。
  2. Modify the page to replace google.maps.LatLng function with your own (see below). 修改页面以将google.maps.LatLng函数替换为您自己的(见下文)。 make sure to run your script AFTER google js is loaded. 确保在加载谷歌js后运行你的脚本。
  3. Load the modified page using scrapyjs (or the instance of webkit created by it) 使用scrapyjs(或由其创建的webkit实例)加载修改后的页面
  4. Parse the page, look for the two special divs created by your fake LatLng function that contain the extracted lat and lng variables. 解析页面,查找由假LatLng函数创建的两个特殊div,其中包含提取的lat和lng变量。

More on step 2: Make your fake LatLng function modify the HTML page to expose lat and lng variables so that you could parse them out with Scrapy. 有关步骤2的更多信息:使您的假LatLng函数修改HTML页面以显示lat和lng变量,以便您可以使用Scrapy解析它们。 Here is some crude code to illustrate: 这里有一些粗略的代码来说明:

var LatLng = function LatLng(lat, lng) {
  var latDiv = document.createElement("div");
  latDiv.id = "extractedLat";
  latDiv.innerHtml = lat;
  document.body.appendChild(latDiv);

  var lngDiv = document.createElement("div");
  lngDiv.id = "extractedLng";
  lngDiv.innerHtml = lng;
  document.body.appendChild(lngDiv);
}

google = {
  map: {
    LatLng: LatLng
  }
};

Overall, this approach sounds a bit painful, but could be fun to try. 总的来说,这种方法听起来有点痛苦,但尝试起来会很有趣。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM