How to scrape a website using curl with proper indentation of html elements

Question

As question states, I use CURl for web-scraping and I get a response which include all html elements but are not in proper indentation.

curl somewebsite.com/somepage > scrape.html/scrape.txt

after this command the data gets saved in scrape.txt or scrape.html file the contents looks very messy and mostly its in 1 line only.

The content of the file looks lke this

<!DOCTYPE html><html lang="en"><head><script src="/cdn-cgi/apps/head/a2ff1ftsK3yTu21p1BeEN2BZsnA.js"></script><link href="https://fonts.googleapis.com/css2?family=DM+Sans:wght@400;700&amp;family=DM+Sans:wght@400&amp;display=swap" rel="stylesheet" media="print" onload="if(!window._isAppPrerendering)this.removeAttribute(&quot;media&quot;);"><link href="https://fonts.googleapis.com/css2?family=DM+Sans:wght@400;700&amp;family=DM+Sans:wght@400&amp;display=swap" rel="preload" as="style"><link href="https://fonts.gstatic.com" rel="preconnect" crossorigin="true"><meta charset="utf-8">

as u see above it's all in 1 line and it goes off till < /html>

Is there any technique in curl or any other easy way to get output of a scraped webpage with indentation followed?

I am OK with solution in PHP, javascript, or NodeJS

Thank you in advance.....

Answer 1

Couldn't find solution for the problem no one answered either.

My solution is to use some beautifying tools like

https://beautifytools.com/html-beautifier.php#

This tool is actually good for large websites with large script and styles.

Answer 2

curl somewebsite.com/somepage | php -r '$d=new DOMDocument();$d->preserveWhiteSpace=false;$d->formatOutput=true;@$d->loadHTML(stream_get_contents(STDIN), LIBXML_HTML_NODEFDTD | LIBXML_NOBLANKS);echo $d->saveXML();' > scrape.html/scrape.txt

Answer 3

You can use tidy - the granddaddy of html tools. Install it then pipe the curl output to it.

sudo apt install tidy

then

curl http://www.example.com | tidy

This should be able to give a tidy html code with tags inline.

How to scrape a website using curl with proper indentation of html elements

Question

3 answers

solution1
0 2021-05-19 15:28:15

solution2
0 2021-05-19 16:06:03

solution3
0 2022-07-27 10:15:44

How to scrape a website using curl with proper indentation of html elements

Question

3 answers

solution1 0 2021-05-19 15:28:15

solution2 0 2021-05-19 16:06:03

solution3 0 2022-07-27 10:15:44

solution1
0 2021-05-19 15:28:15

solution2
0 2021-05-19 16:06:03

solution3
0 2022-07-27 10:15:44