[英]NodeJS and Cheerio web scraping
I made an application where I scrape a page, on that page I have a script like this我制作了一个应用程序,我在其中抓取了一个页面,在那个页面上我有一个这样的脚本
<script>
var myData = { Time: '10:46:29 am', car1: 'Volvo', car2: 'Ferarri', car3: 'VW' };
<script>
With cheerio
and request
node module I get the script but I need to get the value of the car1
, car2
and car3
.使用cheerio
和request
节点模块,我获得了脚本,但我需要获得car1
、 car2
和car3
的值。
request('http://my-url.com', function(error, response, body) {
var $ = cheerio.load(body);
var htmlData = $('body script').last().prev().html();
console.log(data);
});
I've tried to use JSON.parse(htmlData)
but I get the following errors SyntaxError: Unexpected token T
.我尝试使用JSON.parse(htmlData)
但出现以下错误SyntaxError: Unexpected token T
。
Is there any way to parse the javascript from the script, or can someone explain me how to get the values for car1
and car2
via regex?有什么方法可以从脚本中解析 javascript,或者有人可以向我解释如何通过正则表达式获取car1
和car2
的值吗?
I would recommend doing a series of string replacements and then do JSON.load
, to get the JavaScript object, like this我建议做一系列的字符串替换,然后做JSON.load
来获取 JavaScript 对象,就像这样
var data = "{ Time: '10:46:29 am', car1: 'Volvo', car2: 'Ferarri', car3: 'VW' };";
var obj = JSON.parse(data
.replace(/((?:[A-Za-z_][\w\d])+):/g, '"$1":')
.replace(/'/g, '"')
.replace(/;\s*$/, ''));
console.log(obj.car1, obj.car2, obj.car3);
// Volvo Ferarri VW
Here,这里,
.replace(/((?:[A-Za-z_][\w\d])+):/g, '"$1":')
will replace all the strings matched, of the form (?:[A-Za-z_][\w\d])+
with the same matched string surrounded by "
and followed by :
, with "$1":
.将用(?:[A-Za-z_][\w\d])+
形式的所有匹配字符串替换为由"
包围并后跟:
和"$1":
的相同匹配字符串。
And then接着
.replace(/'/g, '"')
will replace all '
with "
(assuming your data will not have '
in them).会将所有'
替换为"
(假设您的数据中没有'
)。
And then接着
.replace(/;\s*$/, '')
will replace the ;
将取代;
followed by whitespace characters at the end, with empty string (basically we remove them).最后是空白字符,空字符串(基本上我们删除它们)。
At this point, the string will look like this此时,字符串将如下所示
{ "Time": "Friday", "car1": "Volvo", "car2": "Ferarri", "car3": "VW" }
and now we simply parse it as JSON string, with JSON.parse
to get the JavaScript object.现在我们简单地将它解析为 JSON 字符串,使用JSON.parse
来获取 JavaScript 对象。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.