简体   繁体   English

Dom文档-抓取数据

[英]Dom Document - Scrape data

I have a jQuery script embed into a webpage that I am scraping with Tampbermonkey and It works well but it is posting back to my server the entire body of the html. 我有一个jQuery脚本嵌入到要用Tampbermonkey抓取的网页中,并且效果很好,但是它将整个html内容发布回了我的服务器。

Embed into an html page that I am scraping has this code: 嵌入到我要抓取的html页面中的代码如下:

jQuery(document.body).append("<iframe id='somenewtab' name='somenewtab' />");
jQuery(document.body).append("

<form action='https://example.com/test.php' target='somenewtab' id='form_submit_data' method='post'>
<input type='hidden' name='data' id='submit_data'><input type='submit' value=''></form>

");
jQuery("#submit_data").val( btoa(unescape(encodeURIComponent(document.body.innerHTML) )));
jQuery("#form_submit_data").submit();

The script grabs all the html and then posts it to php script where it parses the data. 该脚本获取所有html,然后将其发布到php脚本中,在该脚本中解析数据。

test.php test.php

$data = base64_decode($_POST['data']);
$dom = new DOMDocument();
$dom->loadHTML($data);
$select = $dom->getElementById('portfolio');

My question is, is there a way to only post the body of the html without all of there head information or better yet only post back whats inside the getElementById('portfolio') tag? 我的问题是,有没有一种方法可以只发布html的正文而没有所有头信息,或者更好的方法是只返回getElementById('portfolio')标记内的内容? The data in the id tag is the only data I need to parse. id标记中的数据是我需要解析的唯一数据。

Currently it posts everything in the html webpage and the server is getting bogged down with the POST limit size. 目前,它会将所有内容发布到html网页中,并且服务器因POST限制大小而陷入困境。

you can use wrapper based on "simplehtmldom" project available on Sourceforge and get the text/html of the dom element, and can post it. 您可以使用Sourceforge上基于“ simplehtmldom”项目的包装器,并获取dom元素的text / html,然后将其发布。

https://github.com/sachinsinghshekhawat/simple-html-dom-parser-php https://github.com/sachinsinghshekhawat/simple-html-dom-parser-php

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM