[英]Web Scraping via Python Requests returning gibberish
我對Python非常新手,以為我會嘗試嘗試一些實際應用程序。
我正在嘗試使用請求庫將基本的網絡價格抓取工具放在一起。 我選擇了此網頁: https : //www.usstoragecenters.com/storage-units/fl/north-miami-beach/15555-w-dixie-hwy
這是我正在使用的基本結構:
import requests
page = requests.get("my url from above")
page
page.content
但是由於某種原因,通過.content或.text的html打印看起來非常錯誤。 我沒有看到html結構,而是看到了很多回車符。 肯定缺少數據。
我嘗試使用漂亮的湯(html-parser,html5lib等)進行解析,該湯可以切割出更多數據。
這是以阻止抓取的方式進行編碼的,還是我做錯了什么?
問題 :
您面臨的問題是html中嵌入了javascript,因此您會在html頁面中看到數據丟失。 所以這里([requests_html])是一個非常好的庫,旨在通過kennethreitz請求html樣例代碼:
from requests_html import * sessions = Session() r = sessions.get('https://www.usstoragecenters.com/storage-units/fl/north-miami-beach/15555-w-dixie-hwy') for lines in r.iter_lines() : print(lines)
樣品輸出
由於評論大小限制,我無法發布完整的html,這是上面打印的HTML片段
b'<!doctype html>' b'<html>' b'<head>' b'<meta charset="utf-8">' b'<title>Self Storage Units at 15555 West Dixie Highway, North Miami Beach, FL 33162 | US Storage Centers</title>' b'<base href="/">' b'<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" />' b'<meta name="description" content="Brand New Facility Grand Opening! Special 50% Off Self Storage. Friendly Service. Reserve Online for Free. No Credit Card Required." />' b'<meta property="og:type" content="website" />' b'<meta property="og:locale" content="en_US" />' b'<meta property="og:site_name" content="US Storage Centers" />' b'<meta property="og:title" content="Self Storage North Miami Beach" />' b'<meta property="og:url" content="https://www.usstoragecenters.com/storage-units/fl/north-miami-beach/15555-w-dixie-hwy" />' b'<meta property="og:description" content="Brand New Facility Grand Opening! Special 50% Off Self Storage. Friendly Service. Reserve Online for Free. No Credit Card Required." />' b'<meta property="og:image" content="https://www.usstoragecenters.com/www/images/ussc_facility_photos/168/2017-06-15_00-37-08_Self%20Storage%20Building%20Exterior%20Front%20-%20North%20Miami%20Beach%20West%20Dixie%20IMG_5237%208.jpg" />' b'<script type="application/ld+json">' b' {' b' "@context": "http://schema.org",' b' "@type": "WebPage"' b' ,"breadcrumb": {' b' "@context": "http://schema.org",' b' "@type": "BreadcrumbList",' b' "itemListElement": [{' b' "@type": "ListItem",' b' "name": "US Storage Centers",' b' "url": "https://www.usstoragecenters.com/",' b' "position": 0' b' }, {' b' "@type": "ListItem",' b' "name": "Storage Units",' b' "url": "https://www.usstoragecenters.com/storage-units",' b' "position": 1' b' }, {' b' "@type": "ListItem",' b' "name": "FL",' b' "url": "https://www.usstoragecenters.com/storage-units/fl",' b' "position": 2' b' }, {' **...... truncated .....**
調用print(page.content)
它將對返回值等進行編碼(如換行符,制表符等)。
一個測試:
s = """
Hey
\r\r\r\r\r Look
\t\t\t\t\t\t Here"""
print(s)
輸出:
Hey
Look
Here
使用瀏覽器的開發人員工具看到的內容與Web服務器返回的HTML中的內容不符。 在您的Web瀏覽器中查看源代碼,您將看到所有網頁的內容都是使用JavaScript從<script>
標記中包含的JSON生成的。
這使您的工作變得更加輕松,因為您不必擔心解析HTML並只需要從JSON提取數據:
import json
from bs4 import BeautifulSoup
...
soup = BeautifulSoup(page.text)
# Find the `script` tag with no `src` and 'window.jsonData' in its text
script = soup.find('script', src=None, text=lambda text: 'window.jsonData' in text).get_text()
# The JSON is part of script, so just remove the extra stuff
script = script.strip().replace('window.jsonData = ', '').rstrip(';')
# Now parse it
data = json.loads(script)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.