简体   繁体   中英

How to scrape a website using curl with proper indentation of html elements

As question states, I use CURl for web-scraping and I get a response which include all html elements but are not in proper indentation.

curl somewebsite.com/somepage > scrape.html/scrape.txt

after this command the data gets saved in scrape.txt or scrape.html file the contents looks very messy and mostly its in 1 line only.

The content of the file looks lke this

<!DOCTYPE html><html lang="en"><head><script src="/cdn-cgi/apps/head/a2ff1ftsK3yTu21p1BeEN2BZsnA.js"></script><link href="https://fonts.googleapis.com/css2?family=DM+Sans:wght@400;700&amp;family=DM+Sans:wght@400&amp;display=swap" rel="stylesheet" media="print" onload="if(!window._isAppPrerendering)this.removeAttribute(&quot;media&quot;);"><link href="https://fonts.googleapis.com/css2?family=DM+Sans:wght@400;700&amp;family=DM+Sans:wght@400&amp;display=swap" rel="preload" as="style"><link href="https://fonts.gstatic.com" rel="preconnect" crossorigin="true"><meta charset="utf-8">

as u see above it's all in 1 line and it goes off till < /html>

Is there any technique in curl or any other easy way to get output of a scraped webpage with indentation followed?

I am OK with solution in PHP, javascript, or NodeJS

Thank you in advance.....

Couldn't find solution for the problem no one answered either.

My solution is to use some beautifying tools like

https://beautifytools.com/html-beautifier.php#

This tool is actually good for large websites with large script and styles.

curl somewebsite.com/somepage | php -r '$d=new DOMDocument();$d->preserveWhiteSpace=false;$d->formatOutput=true;@$d->loadHTML(stream_get_contents(STDIN), LIBXML_HTML_NODEFDTD | LIBXML_NOBLANKS);echo $d->saveXML();' > scrape.html/scrape.txt

You can use tidy - the granddaddy of html tools. Install it then pipe the curl output to it.

sudo apt install tidy

then

curl http://www.example.com | tidy

This should be able to give a tidy html code with tags inline.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM