简体   繁体   中英

Dom Document - Scrape data

I have a jQuery script embed into a webpage that I am scraping with Tampbermonkey and It works well but it is posting back to my server the entire body of the html.

Embed into an html page that I am scraping has this code:

jQuery(document.body).append("<iframe id='somenewtab' name='somenewtab' />");
jQuery(document.body).append("

<form action='https://example.com/test.php' target='somenewtab' id='form_submit_data' method='post'>
<input type='hidden' name='data' id='submit_data'><input type='submit' value=''></form>

");
jQuery("#submit_data").val( btoa(unescape(encodeURIComponent(document.body.innerHTML) )));
jQuery("#form_submit_data").submit();

The script grabs all the html and then posts it to php script where it parses the data.

test.php

$data = base64_decode($_POST['data']);
$dom = new DOMDocument();
$dom->loadHTML($data);
$select = $dom->getElementById('portfolio');

My question is, is there a way to only post the body of the html without all of there head information or better yet only post back whats inside the getElementById('portfolio') tag? The data in the id tag is the only data I need to parse.

Currently it posts everything in the html webpage and the server is getting bogged down with the POST limit size.

you can use wrapper based on "simplehtmldom" project available on Sourceforge and get the text/html of the dom element, and can post it.

https://github.com/sachinsinghshekhawat/simple-html-dom-parser-php

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM