[英]How to scrape data from JSON/Javascript of web page?
我是Python的新手,今天就開始使用。
我的系統環境是Windows10
上帶有某些庫的Python 3.5
。
我想從以下站點將足球運動員數據提取為CSV文件。
問題 :我無法從soup.find_all('script')[17]
提取數據到我期望的CSV格式。 如何根據需要提取這些數據?
我的代碼如下所示。
from bs4 import BeautifulSoup
import re
from urllib.request import Request, urlopen
req = Request('http://www.futhead.com/squad-building-challenges/squads/343', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage,'html.parser') #not sure if i need to use lxml
soup.find_all('script')[17] #My target data is in 17th
我的預期輸出將與此類似
position,slot_position,slug
ST,ST,paulo-henrique
LM,LM,mugdat-celik
因此,我的理解是beautifulsoup更適合HTML解析,但是您正在嘗試解析嵌套在HTML中的javascript。
所以你有兩個選擇
正如@josiah Swain所說,它不會很漂亮。 對於這種事情,更建議使用JS,因為它可以理解您擁有的東西。
話說回來,python非常棒,這是您的解決方案!
#Same imports as before
from bs4 import BeautifulSoup
import re
from urllib.request import Request, urlopen
#And one more
import json
# The code you had
req = Request('http://www.futhead.com/squad-building-challenges/squads/343',
headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage,'html.parser')
# Store the script
script = soup.find_all('script')[17]
# Extract the oneline that stores all that JSON
uncleanJson = [line for line in script.text.split('\n')
if line.lstrip().startswith('squad.register_players($.parseJSON') ][0]
# The easiest way to strip away all that yucky JS to get to the JSON
cleanJSON = uncleanJson.lstrip() \
.replace('squad.register_players($.parseJSON(\'', '') \
.replace('\'));','')
# Extract out that useful info
data = [ [p['position'],p['data']['slot_position'],p['data']['slug']]
for p in json.loads(cleanJSON)
if p['player'] is not None]
print('position,slot_position,slug')
for line in data:
print(','.join(line))
我將其復制並粘貼到python中的結果是:
position,slot_position,slug
ST,ST,paulo-henrique
LM,LM,mugdat-celik
CAM,CAM,soner-aydogdu
RM,RM,petar-grbic
GK,GK,fatih-ozturk
CDM,CDM,eray-ataseven
LB,LB,kadir-keles
CB,CB,caner-osmanpasa
CB,CB,mustafa-yumlu
RM,RM,ioan-adrian-hora
GK,GK,bora-kork
編輯:經過反思,這對於初學者來說不是最容易閱讀的代碼。 這是一個易於閱讀的版本
# ... All that previous code
script = soup.find_all('script')[17]
allScriptLines = script.text.split('\n')
uncleanJson = None
for line in allScriptLines:
# Remove left whitespace (makes it easier to parse)
cleaner_line = line.lstrip()
if cleaner_line.startswith('squad.register_players($.parseJSON'):
uncleanJson = cleaner_line
cleanJSON = uncleanJson.replace('squad.register_players($.parseJSON(\'', '').replace('\'));','')
print('position,slot_position,slug')
for player in json.loads(cleanJSON):
if player['player'] is not None:
print(player['position'],player['data']['slot_position'],player['data']['slug'])
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.