简体   繁体   English

如何使用bs4从网站获取表数据

[英]How to take table data from a website using bs4

I'm trying to scrape a website that has a table in it using bs4, but the element of the content I'm getting is not as complete compared to the one I get from inspect. 我正在尝试使用bs4抓取一个包含表格的网站,但是与我从检查中获得的内容相比,我获得的内容元素并不完整。 I cannot find the tag <tr> and <td> in it. 我在其中找不到标签<tr><td> How can I get the full content of that site especially the tags for the table? 如何获得该网站的全部内容,尤其是表格的标签?

Here's my code: 这是我的代码:

from bs4 import BeautifulSoup
import requests

link = requests.get("https://pemilu2019.kpu.go.id/#/ppwp/hitung-suara/", verify = False)
src = link.content
soup = BeautifulSoup(src, "html.parser")
print(soup)

I expect the content to have the tag <tr> and <td> in it because they do exist when I inspect it,but I found none from the output. 我希望内容中包含标签<tr><td> ,因为当我检查它们时它们确实存在,但是我在输出中没有找到。

Here's the image of the page where there is the tag <tr> and <td> 这是页面的图像,其中有标签<tr><td>

You should dump the contents of the text you're trying to parse to a file and look at it. 您应该将要解析的文本内容转储到文件中并进行查看。 This will tell you for sure what is and isn't there. 这样可以肯定地告诉您什么是和不存在的。 Like this: 像这样:

from bs4 import BeautifulSoup
import requests

link = requests.get("https://pemilu2019.kpu.go.id/#/ppwp/hitung-suara/", verify = False)
src = link.content
with open("/tmp/content.html", "w") as f:
    f.write(src)
soup = BeautifulSoup(src, "html.parser")
print(soup)

Run this code, and then look at the file "/tmp/content.html" (use a different path, obviously, if you're on Windows), and look at what is actually in the file. 运行此代码,然后查看文件“ /tmp/content.html”(显然,如果使用的是Windows,则使用其他路径),然后查看文件中实际包含的内容。 You could probably do this with your browser, but this this is the way to be the most sure you know what you are getting. 您可能可以使用浏览器来执行此操作,但这是最确保您知道所获得内容的方式。 You could, of course, also just add print(src) , but if it were me, I'd dump it to a file 当然,您也可以只添加print(src) ,但是如果是我,我会将其转储到文件中

If the HTML you're looking for is not in the initial HTML that you're getting back, then that HTML is coming from somewhere else. 如果要查找的HTML不在返回的初始HTML中,则该HTML来自其他地方。 The table could be being built dynamically by JavaScript, or coming from another URL reference, possibly one that calls an HTTP API to grab the table's HTML via parameters passed to the API endpoint. 该表可以由JavaScript动态构建,也可以来自另一个URL引用,该URL引用可以调用HTTP API来通过传递给API端点的参数来获取表的HTML。

You will have to reverse engineer the site's design to find where that HTML comes from. 您将不得不对网站的设计进行逆向工程,以找到HTML的来源。 If it comes from JavaScript, you may be stuck short of scripting the execution of a browser so you can gain access programmatically to the DOM in the browser's memory. 如果它来自JavaScript,则您可能无法编写脚本来执行浏览器的脚本,因此您可以通过编程方式获得对浏览器内存中DOM的访问。

I would recommend running a debugging proxy that will show you each HTTP request being made by your browser. 我建议运行调试代理,该代理将向您显示浏览器发出的每个HTTP请求。 You'll be able to see the contents of each request and response. 您将能够看到每个请求和响应的内容。 If you can do this, you can find the URL that actually returns the content you're looking for, if such a URL exists. 如果可以这样做,则可以找到实际返回所需内容的URL(如果存在)。 You'll have to deal with SSL certificates and such because this is a https endpoint. 您必须处理SSL证书,因为这是一个https端点。 Debugging proxies usually make that pretty easy. 调试代理通常很容易。 We use Charles . 我们使用Charles The standard browser toolboxes might do this too...allow you to see each request and response that is generated by a particular page load. 标准浏览器工具箱也可能会执行此操作...允许您查看由特定页面加载生成的每个请求和响应。

If you can discover the URL that actually returns the table HTML, then you can use that URL to grab it and parse it with BS. 如果您可以找到实际上返回表HTML的URL,则可以使用该URL来获取它并将其与BS解析。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM