As question states, I use CURl for web-scraping and I get a response which include all html elements but are not in proper indentation.
curl somewebsite.com/somepage > scrape.html/scrape.txt
after this command the data gets saved in scrape.txt or scrape.html file the contents looks very messy and mostly its in 1 line only.
The content of the file looks lke this
<!DOCTYPE html><html lang="en"><head><script src="/cdn-cgi/apps/head/a2ff1ftsK3yTu21p1BeEN2BZsnA.js"></script><link href="https://fonts.googleapis.com/css2?family=DM+Sans:wght@400;700&family=DM+Sans:wght@400&display=swap" rel="stylesheet" media="print" onload="if(!window._isAppPrerendering)this.removeAttribute("media");"><link href="https://fonts.googleapis.com/css2?family=DM+Sans:wght@400;700&family=DM+Sans:wght@400&display=swap" rel="preload" as="style"><link href="https://fonts.gstatic.com" rel="preconnect" crossorigin="true"><meta charset="utf-8">
as u see above it's all in 1 line and it goes off till < /html>
Is there any technique in curl or any other easy way to get output of a scraped webpage with indentation followed?
I am OK with solution in PHP, javascript, or NodeJS
Thank you in advance.....
Couldn't find solution for the problem no one answered either.
My solution is to use some beautifying tools like
https://beautifytools.com/html-beautifier.php#
This tool is actually good for large websites with large script and styles.
curl somewebsite.com/somepage | php -r '$d=new DOMDocument();$d->preserveWhiteSpace=false;$d->formatOutput=true;@$d->loadHTML(stream_get_contents(STDIN), LIBXML_HTML_NODEFDTD | LIBXML_NOBLANKS);echo $d->saveXML();' > scrape.html/scrape.txt
You can use tidy - the granddaddy of html tools. Install it then pipe the curl
output to it.
sudo apt install tidy
then
curl http://www.example.com | tidy
This should be able to give a tidy html code with tags inline.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.