簡體   English   中英

如何從網頁的JSON / Javascript抓取數據?

[英]How to scrape data from JSON/Javascript of web page?

我是Python的新手,今天就開始使用。
我的系統環境是Windows10上帶有某些庫的Python 3.5

我想從以下站點將足球運動員數據提取為CSV文件。

問題 :我無法從soup.find_all('script')[17]提取數據到我期望的CSV格式。 如何根據需要提取這些數據?

我的代碼如下所示。

from bs4 import BeautifulSoup
import re
from urllib.request import Request, urlopen

req = Request('http://www.futhead.com/squad-building-challenges/squads/343', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage,'html.parser') #not sure if i need to use lxml
soup.find_all('script')[17] #My target data is in 17th

我的預期輸出將與此類似

position,slot_position,slug
ST,ST,paulo-henrique
LM,LM,mugdat-celik

因此,我的理解是beautifulsoup更適合HTML解析,但是您正在嘗試解析嵌套在HTML中的javascript。

所以你有兩個選擇

  1. 只需創建一個函數,該函數將的結果.find_all('script')[17]循環並手動搜索字符串以獲取數據並提取出來。 您甚至可以使用ast.literal_eval(string_thats_really_a_dictionary)使其變得更加容易。 這可能不是最好的方法,但是如果您不熟悉python,則可能只是為了練習而這樣做。
  2. 像本例一樣使用json庫。 像這樣 這可能是更好的方法。

正如@josiah Swain所說,它不會很漂亮。 對於這種事情,更建議使用JS,因為它可以理解您擁有的東西。

話說回來,python非常棒,這是您的解決方案!

#Same imports as before
from bs4 import BeautifulSoup
import re
from urllib.request import Request, urlopen

#And one more
import json

# The code you had 
req = Request('http://www.futhead.com/squad-building-challenges/squads/343',
               headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage,'html.parser')

# Store the script 
script = soup.find_all('script')[17]

# Extract the oneline that stores all that JSON
uncleanJson = [line for line in script.text.split('\n') 
         if line.lstrip().startswith('squad.register_players($.parseJSON') ][0]

# The easiest way to strip away all that yucky JS to get to the JSON
cleanJSON = uncleanJson.lstrip() \
                       .replace('squad.register_players($.parseJSON(\'', '') \
                       .replace('\'));','')

# Extract out that useful info
data = [ [p['position'],p['data']['slot_position'],p['data']['slug']] 
         for p in json.loads(cleanJSON)
         if p['player'] is not None]


print('position,slot_position,slug')
for line in data:
    print(','.join(line))

我將其復制並粘貼到python中的結果是:

position,slot_position,slug
ST,ST,paulo-henrique
LM,LM,mugdat-celik
CAM,CAM,soner-aydogdu
RM,RM,petar-grbic
GK,GK,fatih-ozturk
CDM,CDM,eray-ataseven
LB,LB,kadir-keles
CB,CB,caner-osmanpasa
CB,CB,mustafa-yumlu
RM,RM,ioan-adrian-hora
GK,GK,bora-kork

編輯:經過反思,這對於初學者來說不是最容易閱讀的代碼。 這是一個易於閱讀的版本

# ... All that previous code 
script = soup.find_all('script')[17]

allScriptLines = script.text.split('\n')

uncleanJson = None
for line in allScriptLines:
     # Remove left whitespace (makes it easier to parse)
     cleaner_line = line.lstrip()
     if cleaner_line.startswith('squad.register_players($.parseJSON'):
          uncleanJson = cleaner_line

cleanJSON = uncleanJson.replace('squad.register_players($.parseJSON(\'', '').replace('\'));','')

print('position,slot_position,slug')
for player in json.loads(cleanJSON):
     if player['player'] is not None:
         print(player['position'],player['data']['slot_position'],player['data']['slug']) 

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM