简体   繁体   English

Python 3.7 urllib.request代替内容

[英]Python 3.7 urllib.request reurns &nbsp instead of content

So I made a code that reads and prints everything in between specified text in HTML code, example , reads all between paragraphs<> - this gets printed. 因此,我编写了一个代码,该代码读取并打印HTML代码中指定文本之间的所有内容,例如example,读取所有段落之间的内容。 This was from sentdex lesson - here 这是来自senddex课- 这里

There is no problem with code, but rather with what is coming out. 代码没有问题,但是问题出在哪里。 I filtered with very specific criteria 我用非常具体的条件进行了过滤

paragraphs = re.findall(r'<div style="font-size: 23px; margin-top: 20px;" class="jsdfx-sentiment-present">(.*?)</div>',str(respData))

So as already mentioned, it works. 因此,如上所述。 Content later is printed and it prints &nbsp . 稍后打印内容,并打印&nbsp。 As I understand this is non-braking space in HTML. 据我了解,这是HTML中的非制动空间。 Instead of space I expected to see numbers. 我希望看到的不是空格,而是数字。 In this website , numbers in this location are updating every few seconds. 在此网站上,此位置的数字每隔几秒钟更新一次。

How can I get to these numbers instead of receiving &nbsp? 我如何获得这些号码而不是接收&nbsp?

Regards! 问候!

It depends on how exactly you're downloading the page, and from where, but because you say the value changes constantly when looking at it in a web browser, I'd wager that when you download the page, that &nbsp is exactly what's inside that div - and the page changes it on-the-fly via javascript or something while you're actually viewing the page. 这取决于您下载页面的方式以及从何处下载,但是由于您说在Web浏览器中查看时值会不断变化,因此我敢保证下载页面时, &nbsp正是其中的内容该div-当您实际查看页面时,页面会通过javascript或其他方式即时更改它。 Your tutorial uses a static tag, one that's the same every time you load the page, rather than one that gets dynamically set after the page is already active. 您的教程使用了一个静态标记,该标记在每次加载页面时都相同,而不是在页面已激活后动态设置的标记。

It's fairly common to do this in web development for dynamic values - put a placeholder value in a div, and then dynamically edit the content as is appropriate. 在Web开发中针对动态值执行此操作是相当普遍的-将占位符值放入div中,然后根据需要动态编辑内容。 If course, if you just take a snapshot of the page (and even moreso if you take that snapshot before the javascript code and whatnot that would have filled in that value has had a chance to run) you're not going to see the change, and you get only the default value, without the number being filled in. 如果可以的话,如果您只是为页面拍摄快照(甚至更多,如果您在javascript代码之前拍摄快照,那么可以填充该值的东西还有机会运行),您将不会看到更改,则只获得默认值,而无需填写数字。

Based on the tutorial you linked, you're probably using urllib . 根据您链接的教程,您可能正在使用urllib If you want to get dynamic content from a HTML page, that's probably not the best tool to use - you should look into selenium and BeautifulSoup . 如果要从HTML页面获取动态内容,那可能不是最好的工具-您应该研究seleniumBeautifulSoup This StackOverflow Answer goes into a lot more detail on effective solutions to this problem. 这个StackOverflow答案详细介绍了解决此问题的有效方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM