简体   繁体   中英

How to Extract the data from html

I was trying to use beautifulsoup4 with python to scrape a certain website. However, when I tried to see contents from the URL, it only gives me a header part and doesn't give me a body part that I want to use.

URL = "url"
URL_page = requests.get(URL)
print(URL_page.text)

this gives me

<!DOCTYPE html>
<html>
 <head>
"Contents of Header"
 </head>
  <body>
   <div id='root'></div>
  </body>
</html>

there should be contents inside the body tag but it shows nothing. the original html of this web page is looks like

<html xmlns:wb="http://open.weibo.com/wb" style> 
 ▶<head...</head>                     ← ONLY GIVES ME THIS
 ▶<body data-loaded="true">...</body> ← I NEED THIS PART
</html>

It's hard to provide a working answer without a working URL, but your question does provide some clues.

For one, you say you receive this in the response from a GET:

<body>

But then you see this in a web browser:

<body data-loaded="true">

This suggests that the page has JavaScript code running that continues loading and constructing the page after the initial page has been loaded.

There's no way using requests or bs4 or something of the sort to get around that. You could check what request follows the initial page load that has the actual content (it may be another piece of html, some json, etc.) and use that request to get the content instead. If you want to try that, try opening the developer tools in a good browser and look at the network tab while the page is loading, you'll see all the requests and one of them may contain the content you're after.

But if you need the html after rendering, as rendered by the script, you can try using a JavaScript capable browser from Python, like Chrome driven through the Selenium Chrome webdriver:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("https://your.url/here")
elem = driver.find_element_by_tag_name('body')
print(elem.text)

Note that you'll need to install Selenium and need to get a copy of the appropriate driver, like chromedriver.exe . Add it to your virtual environment:

I think, you should use 'user-agent' .you can try it:

from bs4 import BeautifulSoup
import requests

headers =  {'User-Agent': 'Mozilla/5.0 (Windows NT x.y; Win64; x64; rv:10.0) Gecko/20100101 Firefox/10.0 '}
url = "https://www.pixiv.net/en/users/14792128"
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
print(soup.prettify())

No idea what exactly you are after or want as an output. But you can access the json response from ajax:

import pandas as pd
import requests

url='https://www.pixiv.net/ajax/user/14792128/profile/all?lang=en'

jsonData = requests.get(url).json()
data = jsonData['body']['mangaSeries']

df = pd.DataFrame(data)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM