简体   繁体   English

抓取javascript生成的网页数据

[英]Scrape web page data generated by javascript

My question is: How to scrape data from this website http://vtis.vn/index.aspx But the data is not shown until you click on for example "Danh sách chậm".我的问题是:如何从这个网站http://vtis.vn/index.aspx抓取数据但是直到你点击例如“Danh sách chậm”,数据才会显示。 I have tried very hard and carefully, when you click on "Danh sách chậm" this is onclick event which triggers some javascript functions one of the js functions is to get the data from the server and insert it to a tag/place holder and at this point you can use something like firefox to examine the data and yes, the data is display to users/viewers on the webpage.我已经非常努力和小心地尝试,当你点击“Danh sách chậm”时,这是触发一些 javascript 函数的 onclick 事件 js 函数之一是从服务器获取数据并将其插入标签/占位符和在这一点上,您可以使用类似 Firefox 的东西来检查数据,是的,数据会显示给网页上的用户/查看者。 So again, how can we scrap this data programmatically?那么,我们如何以编程方式废弃这些数据?

i wrote a scrapping function but ofcourse it does not get the data i want because the data is not available until i click on the button "Danh sách chậm"我写了一个报废功能,但当然它没有得到我想要的数据,因为数据不可用,直到我点击按钮“Danh sách chậm”

<?php
$Page = file_get_contents('http://vtis.vn/index.aspx');
$dom_document = new DOMDocument();
$dom_document->loadHTML($Page);
$dom_xpath_admin = new DOMXpath($dom_document_admin);
$elements = $dom_xpath->query("*//td[@class='IconMenuColumn']");
foreach ($elements as $element) {
    $nodes = $element->childNodes;
    foreach ($nodes as $node) {
        echo mb_convert_encoding($node->c14n(), 'iso-8859-1', mb_detect_encoding($content, 'UTF-8', true));
    }
}

You need to look at PhantomJS .你需要看看PhantomJS

From their site:从他们的网站:

PhantomJS is a headless WebKit with JavaScript API. PhantomJS 是一个带有 JavaScript API 的无头 WebKit。 It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG.它对各种 Web 标准提供快速和原生支持:DOM 处理、CSS 选择器、JSON、Canvas 和 SVG。

Using the API you can script the "browser" to interact with that page and scrape the data you need.使用 API,您可以编写“浏览器”脚本以与该页面交互并抓取您需要的数据。 You can then do whatever you need with it;然后你可以用它做任何你需要的事情; including passing it to a PHP script if necessary.包括必要时将其传递给 PHP 脚本。


That being said, if at all possible try not to "scrape" the data.话虽如此,如果可能的话,尽量不要“抓取”数据。 If there is an ajax call the page is making, maybe there is an API you can use instead?如果页面正在进行 ajax 调用,也许有一个 API 可以代替? If not, maybe you can convince them to make one.如果没有,也许你可以说服他们制作一个。 That would of course be much easier and more maintainable than screen scraping.这当然比屏幕抓取更容易和更易于维护。

First, you need PhantomJS .首先,你需要PhantomJS Suggested install method on Linux:建议在 Linux 上的安装方法:

wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2
tar xvf phantomjs-2.1.1-linux-x86_64.tar.bz2
cp phantomjs-2.1.1-linux-x86_64/bin/phantomjs /usr/local/bin

Second, you need the php-phantomjs package .其次,您需要php-phantomjs 包 Assuming you have installed Composer :假设您已经安装Composer

composer require jonnyw/php-phantomjs

Or follow installation documentation here .或者按照此处的安装文档进行操作

Third, Load the package to your script, and instead of file_get_contents , you will load the page via PhantomJS第三,将包加载到您的脚本中,而不是file_get_contents ,您将通过 PhantomJS 加载页面

<?php
require ('vendor/autoload.php');

$client = Client::getInstance();
$client->getEngine()->setPath('/usr/local/bin/phantomjs');
$client = Client::getInstance();
$request  = $client->getMessageFactory()->createRequest();
$response = $client->getMessageFactory()->createResponse();

$request->setMethod('GET');
$request->setUrl('https://www.your_page_embeded_ajax_request');

$client->send($request, $response);

if($response->getStatus() === 200) {
    echo "Do something here";
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM