通過Python進行Web爬網請求返回亂碼

Question

我對Python非常新手，以為我會嘗試嘗試一些實際應用程序。

我正在嘗試使用請求庫將基本的網絡價格抓取工具放在一起。 我選擇了此網頁： https : //www.usstoragecenters.com/storage-units/fl/north-miami-beach/15555-w-dixie-hwy

這是我正在使用的基本結構：

import requests

page = requests.get("my url from above")
page

page.content

但是由於某種原因，通過.content或.text的html打印看起來非常錯誤。 我沒有看到html結構，而是看到了很多回車符。 肯定缺少數據。

我嘗試使用漂亮的湯（html-parser，html5lib等）進行解析，該湯可以切割出更多數據。

這是以阻止抓取的方式進行編碼的，還是我做錯了什么？

Answer 1

問題：

您面臨的問題是html中嵌入了javascript，因此您會在html頁面中看到數據丟失。 所以這里（[requests_html]）是一個非常好的庫，旨在通過kennethreitz請求html

樣例代碼：

 from requests_html import * sessions = Session() r = sessions.get('https://www.usstoragecenters.com/storage-units/fl/north-miami-beach/15555-w-dixie-hwy') for lines in r.iter_lines() : print(lines)

樣品輸出

由於評論大小限制，我無法發布完整的html，這是上面打印的HTML片段

 b'<!doctype html>' b'<html>' b'<head>' b'<meta charset="utf-8">' b'<title>Self Storage Units at 15555 West Dixie Highway, North Miami Beach, FL 33162 | US Storage Centers</title>' b'<base href="/">' b'<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" />' b'<meta name="description" content="Brand New Facility Grand Opening! Special 50% Off Self Storage. Friendly Service. Reserve Online for Free. No Credit Card Required." />' b'<meta property="og:type" content="website" />' b'<meta property="og:locale" content="en_US" />' b'<meta property="og:site_name" content="US Storage Centers" />' b'<meta property="og:title" content="Self Storage North Miami Beach" />' b'<meta property="og:url" content="https://www.usstoragecenters.com/storage-units/fl/north-miami-beach/15555-w-dixie-hwy" />' b'<meta property="og:description" content="Brand New Facility Grand Opening! Special 50% Off Self Storage. Friendly Service. Reserve Online for Free. No Credit Card Required." />' b'<meta property="og:image" content="https://www.usstoragecenters.com/www/images/ussc_facility_photos/168/2017-06-15_00-37-08_Self%20Storage%20Building%20Exterior%20Front%20-%20North%20Miami%20Beach%20West%20Dixie%20IMG_5237%208.jpg" />' b'<script type="application/ld+json">' b' {' b' "@context": "http://schema.org",' b' "@type": "WebPage"' b' ,"breadcrumb": {' b' "@context": "http://schema.org",' b' "@type": "BreadcrumbList",' b' "itemListElement": [{' b' "@type": "ListItem",' b' "name": "US Storage Centers",' b' "url": "https://www.usstoragecenters.com/",' b' "position": 0' b' }, {' b' "@type": "ListItem",' b' "name": "Storage Units",' b' "url": "https://www.usstoragecenters.com/storage-units",' b' "position": 1' b' }, {' b' "@type": "ListItem",' b' "name": "FL",' b' "url": "https://www.usstoragecenters.com/storage-units/fl",' b' "position": 2' b' }, {' **...... truncated .....**

Answer 2

調用print(page.content)

它將對返回值等進行編碼（如換行符，制表符等）。

一個測試：

s = """
     Hey
    \r\r\r\r\r Look
    \t\t\t\t\t\t Here"""
print(s)

輸出：

 Hey





 Look
                             Here

Answer 3

使用瀏覽器的開發人員工具看到的內容與Web服務器返回的HTML中的內容不符。 在您的Web瀏覽器中查看源代碼，您將看到所有網頁的內容都是使用JavaScript從<script>標記中包含的JSON生成的。

這使您的工作變得更加輕松，因為您不必擔心解析HTML並只需要從JSON提取數據：

import json
from bs4 import BeautifulSoup

...

soup = BeautifulSoup(page.text)

# Find the `script` tag with no `src` and 'window.jsonData' in its text
script = soup.find('script', src=None, text=lambda text: 'window.jsonData' in text).get_text()


# The JSON is part of script, so just remove the extra stuff
script = script.strip().replace('window.jsonData = ', '').rstrip(';')

# Now parse it
data = json.loads(script)

通過Python進行Web爬網請求返回亂碼

問題描述

3 個解決方案

解決方案1
1 2018-04-04 21:00:09

解決方案2
0 2018-04-04 20:06:38

解決方案3
0 2018-04-04 20:11:44

通過Python進行Web爬網請求返回亂碼

問題描述

3 個解決方案

解決方案1 1 2018-04-04 21:00:09

解決方案2 0 2018-04-04 20:06:38

解決方案3 0 2018-04-04 20:11:44

解決方案1
1 2018-04-04 21:00:09

解決方案2
0 2018-04-04 20:06:38

解決方案3
0 2018-04-04 20:11:44