简体繁体中英

Extract html sourcecode from a javascript generated output

原文 2020-02-23 14:47:52 3 1 javascript/ html/ beautifulsoup/ text-extraction

I am currently working on a project of finding empty classrooms in our school in real time. For that purpose, I need to extract substitution published on our school page ( https://ssnovohradska.edupage.org/substitution/ ?), since there might be any additional changes.

But when I try to extract the html source code and parse it with bs4, it cannot find the divs(class: "section print-nobreak") that contain the substitution text. When I took a look at the page source code(Ctrl+U) I found that there is only a javascript that prints it all directly.

Is there any way to extract the html after the javascript output has been already rendered?

Thanks for help!

1 answers

Parsing HTML is unfortunately necessary to solve your problem. But I will explain how to find ways to avoid that in your future projects (not based on this website).

You've correctly noticed that the text is created by JavaScript code running on the page. This could also indicate that the data is either loaded from another resource (XHR/fetch call getting a response from an API) or is stored as a JSON/JS inside of the website's code. (Or is generated from an algorithm, but this is unlikely to be the case in such websites.)
The website actually uses both methods (initial render gets data stored inside of the website's code, but when you switch dates on the calendar it makes AJAX requests). You can see this by searching for ReactDOM.render(React.createElement( in the code. They're providing a HTML string to the createElement call, so I would suggest looking into the AJAX way of doing things.
Now, to check where the resource is located, all you need to do is opening Developer Tools in your favorite browser (usually Control+Shift+I) and navigating to the Network tab. Now that your.network tab is open, you need to cause the website to load external data, for example, by pressing a date on the "calendar bar".
Here you will notice many external requests, but we're actually looking only for XHR calls. Click on the XHR button next to the "Filter" text field. That should result in only one request being shown:

Unfortunately for us, the response only contains HTML. Also, API calls are protected - they require a PHP session ID and some sort of a token ( __gsh ) to not fail. So, going back to step 1 - seems like our only solution is to use regular expressions to find the text between "report_html":"<div class and </div></div></div> from the source code, if you're interested in today's date only. If you want to get contents for tomorrow or any other date - you will need to either fetch the page, save the cookies and find the token to supply to the request and then make that request, or use something like puppeteer or pyppeteer (since you've mentioned BS4) and load the webpage in that. If you aren't doing the data fetching that often, you should be fine overall.

Save full html page sourcecode in a Javascript variable

Trying to understand this function in javascript from a website sourcecode

How to extract the dynamically generated HTML from a website

Ajax For HTML Generated From Javascript

How to extract user-visible HTML from a TWebBrowser that's generated by javascript

How to extract input tag value from a javascript generated HTML page, using PHP?

How to extract content from source generated by javascript

Output html from Javascript

Extract text from HTML with Javascript

How to get html with javascript rendered sourcecode by using selenium

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Save full html page sourcecode in a Javascript variable Trying to understand this function in javascript from a website sourcecode How to extract the dynamically generated HTML from a website Ajax For HTML Generated From Javascript How to extract user-visible HTML from a TWebBrowser that's generated by javascript How to extract input tag value from a javascript generated HTML page, using PHP? How to extract content from source generated by javascript Output html from Javascript Extract text from HTML with Javascript How to get html with javascript rendered sourcecode by using selenium

Related Tags

Extract html sourcecode from a javascript generated output

Question

1 answers

solution1 0 2020-02-23 15:10:55

solution1
0 2020-02-23 15:10:55