如何從網頁的JSON / Javascript抓取數據？

Question

我是Python的新手，今天就開始使用。
我的系統環境是Windows10上帶有某些庫的Python 3.5 。

我想從以下站點將足球運動員數據提取為CSV文件。

問題：我無法從soup.find_all('script')[17]提取數據到我期望的CSV格式。 如何根據需要提取這些數據？

我的代碼如下所示。

from bs4 import BeautifulSoup
import re
from urllib.request import Request, urlopen

req = Request('http://www.futhead.com/squad-building-challenges/squads/343', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage,'html.parser') #not sure if i need to use lxml
soup.find_all('script')[17] #My target data is in 17th

我的預期輸出將與此類似

position,slot_position,slug
ST,ST,paulo-henrique
LM,LM,mugdat-celik

Answer 1

因此，我的理解是beautifulsoup更適合HTML解析，但是您正在嘗試解析嵌套在HTML中的javascript。

所以你有兩個選擇

只需創建一個函數，該函數將湯的結果.find_all（'script'）[17]循環並手動搜索字符串以獲取數據並提取出來。 您甚至可以使用ast.literal_eval（string_thats_really_a_dictionary）使其變得更加容易。 這可能不是最好的方法，但是如果您不熟悉python，則可能只是為了練習而這樣做。
像本例一樣使用json庫。 或像這樣 這可能是更好的方法。

Answer 2

正如@josiah Swain所說，它不會很漂亮。 對於這種事情，更建議使用JS，因為它可以理解您擁有的東西。

話說回來，python非常棒，這是您的解決方案！

#Same imports as before
from bs4 import BeautifulSoup
import re
from urllib.request import Request, urlopen

#And one more
import json

# The code you had 
req = Request('http://www.futhead.com/squad-building-challenges/squads/343',
               headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage,'html.parser')

# Store the script 
script = soup.find_all('script')[17]

# Extract the oneline that stores all that JSON
uncleanJson = [line for line in script.text.split('\n') 
         if line.lstrip().startswith('squad.register_players($.parseJSON') ][0]

# The easiest way to strip away all that yucky JS to get to the JSON
cleanJSON = uncleanJson.lstrip() \
                       .replace('squad.register_players($.parseJSON(\'', '') \
                       .replace('\'));','')

# Extract out that useful info
data = [ [p['position'],p['data']['slot_position'],p['data']['slug']] 
         for p in json.loads(cleanJSON)
         if p['player'] is not None]


print('position,slot_position,slug')
for line in data:
    print(','.join(line))

我將其復制並粘貼到python中的結果是：

position,slot_position,slug
ST,ST,paulo-henrique
LM,LM,mugdat-celik
CAM,CAM,soner-aydogdu
RM,RM,petar-grbic
GK,GK,fatih-ozturk
CDM,CDM,eray-ataseven
LB,LB,kadir-keles
CB,CB,caner-osmanpasa
CB,CB,mustafa-yumlu
RM,RM,ioan-adrian-hora
GK,GK,bora-kork

編輯：經過反思，這對於初學者來說不是最容易閱讀的代碼。 這是一個易於閱讀的版本

# ... All that previous code 
script = soup.find_all('script')[17]

allScriptLines = script.text.split('\n')

uncleanJson = None
for line in allScriptLines:
     # Remove left whitespace (makes it easier to parse)
     cleaner_line = line.lstrip()
     if cleaner_line.startswith('squad.register_players($.parseJSON'):
          uncleanJson = cleaner_line

cleanJSON = uncleanJson.replace('squad.register_players($.parseJSON(\'', '').replace('\'));','')

print('position,slot_position,slug')
for player in json.loads(cleanJSON):
     if player['player'] is not None:
         print(player['position'],player['data']['slot_position'],player['data']['slug'])

如何從網頁的JSON / Javascript抓取數據？

問題描述

2 個解決方案

解決方案1
0 2017-10-07 16:51:25

解決方案2
0 已采納 2017-10-08 02:50:23

如何從網頁的JSON / Javascript抓取數據？

問題描述

2 個解決方案

解決方案1 0 2017-10-07 16:51:25

解決方案2 0 已采納 2017-10-08 02:50:23

解決方案1
0 2017-10-07 16:51:25

解決方案2
0 已采納 2017-10-08 02:50:23