简体   繁体   English

NodeJS 和 Cheerio 网络抓取

[英]NodeJS and Cheerio web scraping

I made an application where I scrape a page, on that page I have a script like this我制作了一个应用程序,我在其中抓取了一个页面,在那个页面上我有一个这样的脚本

<script>
var myData = { Time: '10:46:29 am', car1: 'Volvo', car2: 'Ferarri', car3: 'VW' };
<script>

With cheerio and request node module I get the script but I need to get the value of the car1 , car2 and car3 .使用cheeriorequest节点模块,我获得了脚本,但我需要获得car1car2car3的值。

request('http://my-url.com', function(error, response, body) {
    
    var $ = cheerio.load(body);
   
    var htmlData = $('body script').last().prev().html();
    console.log(data);  
        
});

I've tried to use JSON.parse(htmlData) but I get the following errors SyntaxError: Unexpected token T .我尝试使用JSON.parse(htmlData)但出现以下错误SyntaxError: Unexpected token T

Is there any way to parse the javascript from the script, or can someone explain me how to get the values for car1 and car2 via regex?有什么方法可以从脚本中解析 javascript,或者有人可以向我解释如何通过正则表达式获取car1car2的值吗?

I would recommend doing a series of string replacements and then do JSON.load , to get the JavaScript object, like this我建议做一系列的字符串替换,然后做JSON.load来获取 JavaScript 对象,就像这样

var data = "{ Time: '10:46:29 am', car1: 'Volvo', car2: 'Ferarri', car3: 'VW' };";
var obj = JSON.parse(data
  .replace(/((?:[A-Za-z_][\w\d])+):/g, '"$1":')
  .replace(/'/g, '"')
  .replace(/;\s*$/, ''));
console.log(obj.car1, obj.car2, obj.car3);
// Volvo Ferarri VW

Here,这里,

.replace(/((?:[A-Za-z_][\w\d])+):/g, '"$1":')

will replace all the strings matched, of the form (?:[A-Za-z_][\w\d])+ with the same matched string surrounded by " and followed by : , with "$1": .将用(?:[A-Za-z_][\w\d])+形式的所有匹配字符串替换为由"包围并后跟:"$1":的相同匹配字符串。

And then接着

.replace(/'/g, '"')

will replace all ' with " (assuming your data will not have ' in them).会将所有'替换为" (假设您的数据中没有' )。

And then接着

.replace(/;\s*$/, '')

will replace the ;将取代; followed by whitespace characters at the end, with empty string (basically we remove them).最后是空白字符,空字符串(基本上我们删除它们)。

At this point, the string will look like this此时,字符串将如下所示

{ "Time": "Friday", "car1": "Volvo", "car2": "Ferarri", "car3": "VW" }

and now we simply parse it as JSON string, with JSON.parse to get the JavaScript object.现在我们简单地将它解析为 JSON 字符串,使用JSON.parse来获取 JavaScript 对象。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM