简体   繁体   中英

Python 3.7 urllib.request reurns &nbsp instead of content

So I made a code that reads and prints everything in between specified text in HTML code, example , reads all between paragraphs<> - this gets printed. This was from sentdex lesson - here

There is no problem with code, but rather with what is coming out. I filtered with very specific criteria

paragraphs = re.findall(r'<div style="font-size: 23px; margin-top: 20px;" class="jsdfx-sentiment-present">(.*?)</div>',str(respData))

So as already mentioned, it works. Content later is printed and it prints &nbsp . As I understand this is non-braking space in HTML. Instead of space I expected to see numbers. In this website , numbers in this location are updating every few seconds.

How can I get to these numbers instead of receiving &nbsp?

Regards!

It depends on how exactly you're downloading the page, and from where, but because you say the value changes constantly when looking at it in a web browser, I'd wager that when you download the page, that &nbsp is exactly what's inside that div - and the page changes it on-the-fly via javascript or something while you're actually viewing the page. Your tutorial uses a static tag, one that's the same every time you load the page, rather than one that gets dynamically set after the page is already active.

It's fairly common to do this in web development for dynamic values - put a placeholder value in a div, and then dynamically edit the content as is appropriate. If course, if you just take a snapshot of the page (and even moreso if you take that snapshot before the javascript code and whatnot that would have filled in that value has had a chance to run) you're not going to see the change, and you get only the default value, without the number being filled in.

Based on the tutorial you linked, you're probably using urllib . If you want to get dynamic content from a HTML page, that's probably not the best tool to use - you should look into selenium and BeautifulSoup . This StackOverflow Answer goes into a lot more detail on effective solutions to this problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM