简体   繁体   English

用Javascript抓取网页?

[英]Scraping a webpage with Javascript?

I'm looking for a way to get all the sentences in 'lead'-tags from this webpage: http://taz.de/!p4633/c.xml and put them into an array. 我正在寻找一种方法来从以下网页的'lead'-tags中获取所有句子: http ://taz.de/!p4633/c.xml并将其放入数组中。 Can a javascript program get information from the web like that? javascript程序可以从网络上获取信息吗?

for example, where it says 例如,它说

<lead>Sentence1 blablabla. sentence2 bla bla bla.</lead>
<headline>something else</headline>
<lead>sentence3 blablabla. sentence4 bla bla.</lead>

I'd like to get the strings like so: 我想要这样的字符串:

var sentences = ["Sentence1 blablabla.", "sentence2 bla bla bla.", "sentence3 blablabla.", "sentence4 bla bla."];

The reason is that I want to make a twitterbot that answers with random sentences from this newspaper's website. 原因是我想制作一个Twitter机器人,用本报纸网站上的随机句子回答。 I searched tutorials for webscraping, but I'm not familiar with node.js and couldn't get any of the other tools to work either, because I know so little about programming. 我在教程中搜索了网络抓取功能,但是我对node.js并不熟悉,也无法使用其他任何工具,因为我对编程知之甚少。

Can a javascript program get information from the web like that? javascript程序可以从网络上获取信息吗?

Yes. 是。

You'll need to know about Node's HTTP module , particularly http.get . 您需要了解Node的HTTP模块 ,尤其是http.get Then you'll need an XML parser. 然后,您将需要一个XML解析器。 There should be a bunch floating around in npm, choose any one. npm中应该有一堆,选择任何一个。 Get the XML, parse the XML, pick the pieces of data you want and put them in an array. 获取XML,解析XML,选择所需的数据并将其放入数组中。

This will work for you, I couln't get a real response from your webpage cause of firewall, but try this example code, and tell us: 这将为您工作,我不会从您的网页上看到引起防火墙的真实响应,但是尝试此示例代码,并告诉我们:

 var fakeResponse = "<xml><lead>Sentence1 blablabla. sentence2 bla bla bla.</lead></xml>"; function processResponse(response) { var parser = new DOMParser(); var xmlDoc = parser.parseFromString(response, "text/xml"); //important to use "text/xml" for (var i=0; i < xmlDoc.getElementsByTagName("lead").length; i++) { var html = xmlDoc.getElementsByTagName("lead")[i].innerHTML; console.log("item " + i + "=>" + html); } } var xhttp = new XMLHttpRequest(); xhttp.onreadystatechange = function() { if (this.readyState == 4 && this.status == 200) { var responseData = this.responseText; responseData = fakeResponse; //delete This line, just for testing processResponse(responseData) } }; var your_url = "https://jsonplaceholder.typicode.com/posts/1"; //update with the url of your webservice xhttp.open("GET", your_url, true); xhttp.send(); 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM