简体   繁体   English

如何从网页的源中获取字符串,但是在beautifulsoup中不存在

[英]How to get a string from a web page's source but that's not there in beautifulsoup

I'm a beginner to web scraping. 我是网络抓取的初学者。 I'm attempting to scrape from this website. 我正在尝试从该网站抓取。 Except when I attempt to get some information in the following td element there is a text field missing, but this is there on the website when I look at its source. 除非我尝试在以下td元素中获取一些信息,否则缺少一个文本字段,但是当我查看其来源时,该字段就在网站上。

Below is the code returned from the beautiful soup parser. 下面是从漂亮的汤解析器返回的代码。 On the webpage however there is a string put on right after the tag closes. 但是,在网页上,标签关闭后会出现一个字符串。 I would like to be able to scrape this string, how would I do that 我希望能够抓取此字符串,我该怎么做

<td style="text-align:left; font-weight:bold;"><script type="text/javascript">document.write(Base64.decode(str_rot13("ZGL3Ywx5YwR1YwR2AN==")))</script></td>

Here is what is on the webpage 这是网页上的内容

<td style="text-align:left; font-weight:bold;"><script type="text/javascript">document.write(Base64.decode(str_rot13("ZGDjYwVjAF4lZwVhZj==")))</script>140.205.222.3</td>

My question is why does this appear in the webpage source but not in the beautiful soup text & how would I go about obtaining this information? 我的问题是,为什么它会出现在网页源中,而不是出现在漂亮的汤中?我将如何获得这些信息?

You don't see the text because BeautifulSoup doesn't run javascript, it just parses html text. 您不会看到该文本,因为BeautifulSoup不会运行javascript,它只会解析html文本。 You must use Selenium or headless browser and execute javascript on that page to obtain the text. 您必须使用Selenium或无头浏览器并在该页面上执行javascript以获取文本。 However, this simple javascript function you can emulate in Python too (with help of Short rot13 function - Python ): 但是,您也可以在Python中模拟这个简单的javascript函数(借助Short rot13函数-Python ):

data = '''
<td style="text-align:left; font-weight:bold;">
    <script type="text/javascript">document.write(Base64.decode(str_rot13("ZGDjYwVjAF4lZwVhZj==")))</script>
</td>'''

from bs4 import BeautifulSoup
import re
import base64

rot13 = str.maketrans(
    "ABCDEFGHIJKLMabcdefghijklmNOPQRSTUVWXYZnopqrstuvwxyz",
    "NOPQRSTUVWXYZnopqrstuvwxyzABCDEFGHIJKLMabcdefghijklm")

soup = BeautifulSoup(data, 'lxml')
encoded_string = re.search(r'str_rot13\("(.*?)"\)', str(soup.find('script')))[1]
decoded_string = base64.b64decode(encoded_string.translate(rot13)).decode('utf-8')

print(decoded_string)

This prints the decoded string: 这将输出解码后的字符串:

140.205.222.3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM