简体   繁体   English

如何从 html 中提取数据

[英]How to Extract the data from html

I was trying to use beautifulsoup4 with python to scrape a certain website.我试图使用 beautifulsoup4 和 python 来抓取某个网站。 However, when I tried to see contents from the URL, it only gives me a header part and doesn't give me a body part that I want to use.但是,当我试图查看 URL 的内容时,它只给了我一个 header 部分,并没有给我一个我想使用的身体部分。

URL = "url"
URL_page = requests.get(URL)
print(URL_page.text)

this gives me这给了我

<!DOCTYPE html>
<html>
 <head>
"Contents of Header"
 </head>
  <body>
   <div id='root'></div>
  </body>
</html>

there should be contents inside the body tag but it shows nothing.正文标签内应该有内容,但它什么也不显示。 the original html of this web page is looks like这个 web 页面的原始 html 看起来像

<html xmlns:wb="http://open.weibo.com/wb" style> 
 ▶<head...</head>                     ← ONLY GIVES ME THIS
 ▶<body data-loaded="true">...</body> ← I NEED THIS PART
</html>

It's hard to provide a working answer without a working URL, but your question does provide some clues.如果没有有效的 URL,很难提供有效的答案,但您的问题确实提供了一些线索。

For one, you say you receive this in the response from a GET:一方面,你说你在 GET 的响应中收到了这个:

<body>

But then you see this in a web browser:但随后您在 web 浏览器中看到:

<body data-loaded="true">

This suggests that the page has JavaScript code running that continues loading and constructing the page after the initial page has been loaded.这表明该页面运行了 JavaScript 代码,该代码在初始页面加载后继续加载和构建页面。

There's no way using requests or bs4 or something of the sort to get around that.没有办法使用requestsbs4或类似的东西来解决这个问题。 You could check what request follows the initial page load that has the actual content (it may be another piece of html, some json, etc.) and use that request to get the content instead.可以检查具有实际内容的初始页面加载之后的请求(它可能是另一块 html、一些 json 等)并使用该请求来获取内容。 If you want to try that, try opening the developer tools in a good browser and look at the network tab while the page is loading, you'll see all the requests and one of them may contain the content you're after.如果您想尝试,请尝试在良好的浏览器中打开开发人员工具,并在页面加载时查看网络选项卡,您将看到所有请求,其中一个可能包含您所追求的内容。

But if you need the html after rendering, as rendered by the script, you can try using a JavaScript capable browser from Python, like Chrome driven through the Selenium Chrome webdriver: But if you need the html after rendering, as rendered by the script, you can try using a JavaScript capable browser from Python, like Chrome driven through the Selenium Chrome webdriver:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("https://your.url/here")
elem = driver.find_element_by_tag_name('body')
print(elem.text)

Note that you'll need to install Selenium and need to get a copy of the appropriate driver, like chromedriver.exe .请注意,您需要安装 Selenium 并且需要获取相应驱动程序的副本,例如chromedriver.exe Add it to your virtual environment:将其添加到您的虚拟环境中:

I think, you should use 'user-agent' .you can try it:我认为,您应该使用'user-agent' 。您可以尝试一下:

from bs4 import BeautifulSoup
import requests

headers =  {'User-Agent': 'Mozilla/5.0 (Windows NT x.y; Win64; x64; rv:10.0) Gecko/20100101 Firefox/10.0 '}
url = "https://www.pixiv.net/en/users/14792128"
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
print(soup.prettify())

No idea what exactly you are after or want as an output.不知道您究竟想要什么作为 output。 But you can access the json response from ajax:但是您可以访问来自 ajax 的 json 响应:

import pandas as pd
import requests

url='https://www.pixiv.net/ajax/user/14792128/profile/all?lang=en'

jsonData = requests.get(url).json()
data = jsonData['body']['mangaSeries']

df = pd.DataFrame(data)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM