简体   繁体   中英

How do I detect if a web page is rendered dynamically from Javascript in Python?

I am building a web scraper that has to retrieve quickly the text of a web page, from HTML only. I'm using Python, requests and BeautifulSoup . I would like to detect if the web page content is pure HTML or if it's rendered from Javascript. In this last case, I would just return an error message saying that this cannot be done.

I know about headless browsers to render the Javascript but in this case I really just need to detect it the fastest way possible without having to render it.

It's not really possible to detect script tag as there are many in every webpage and that doesn't mean the text content is rendered in Javascript necessarily.

Is there something I could check jn the HTML that tells me accurately that the body content will be rendered from Javascript?

Thank you

There is nothing in the initial DOM that shows beforehand that the site is rendered with js. These are some stuff you could try:

  • Analyzing several websites and make a guess on where the site is rendered with js based on the page's content size.
  • You could also get the html of different pages of the site and compare the content length (for a js-rendered site, the contents of different pages are likely to be the same/similar before any code is executed).
  • Check the content size of the scripts or detect the scripts names of famous technologies like react, vue and angular

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM