簡體   English   中英

如何使用 Python 解析數百個具有相同 JSON 數據類型的網站?

[英]How to parse hundreds of websites with same JSON data type using Python?

我是 Python 的新手,目前正在從事一個項目,該項目需要我從數百個包含 JSON 數據的網站中提取數據。 我設法從一個網站上抓取數據,但不知道如何同時抓取所有網站。 下面是我的代碼。

import openpyxl
import requests
import pandas as pd
import simplejson as json


url="https://ws-public.interpol.int/notices/v1/red?ageMin=45&ageMax=60&arrestWarrantCountryId=US&resultPerPage=20&page=1"

response=requests.get(url)
response.raise_for_status()

data=response.json()['_embedded']['notices']
list=[]

for item in data:
    result={"forename":None,"date_of_birth":None,"nationalities":None,"name":None}
    result["forename"] = item["forename"]
    result["date_of_birth"]=item["date_of_birth"]
    result["nationalities"] = item["nationalities"]
    result["name"] = item["name"]
    list.append(result)

#print(list)

df=pd.DataFrame(list)
df.to_excel("test.xlsx")

其他網站示例: https://ws-public.interpol.int/notices/v1/red?arrestWarrantCountryId=BA&resultPerPage=20&page=5 , https://ws-public.interpol.int/notices/v1 =BA&resultPerPage=20&page=1 ,

我認為這對你有用。 您必須手動添加 url 或指定一些邏輯來獲取它們,我還注意到 json 響應具有 url 用於下一頁,因此您可以獲得所有第一頁的列表並使用它們來爬取頁面,除非您可以在一個 json 響應中獲得所有結果。 我也沒有安裝 excel 所以我用 csv 代替,但它應該是一樣的:

import requests
import pandas as pd

urls = [
    'https://ws-public.interpol.int/notices/v1/red?ageMin=45&ageMax=60&arrestWarrantCountryId=US&resultPerPage=20&page=1',
    'https://ws-public.interpol.int/notices/v1/red?arrestWarrantCountryId=BA&resultPerPage=20&page=5',
    'https://ws-public.interpol.int/notices/v1/red?arrestWarrantCountryId=BA&resultPerPage=20&page=1',
    # add more urls here, you could also use a file to store these
    # you could also write some logic to get the urls but you'd need to specify that logic
]

def get_data(url):
    data = requests.get(url).json()['_embedded']['notices']
    # filter the returned fields
    return [{k: v for k, v in row.items()
             if k in ['forename', 'date_of_birth', 'nationalities', 'name']}
            for row in data]

df = pd.DataFrame()
# the data from each url in a dataframe instead of in dictionary for speed
for url in urls:
    print(f'Processing {url}')
    df = df.append(get_data(url))

# output to csv or whatever (I don't have excel installed so I did csv)
df.to_csv('data.csv')
# df.to_excel('data.xlsx')

Output (data.csv):

,forename,date_of_birth,nationalities,name
0,CARLOS LEOPOLDO,1971/10/31,['US'],ALVAREZ
1,MOHAMED ABDIAZIZ,1974/01/01,"['SO', 'ET']",KEROW
2,SEUXIS PAUCIS,1966/07/30,['CO'],HERNANDEZ-SOLARTE
3,JOHN G.,1966/10/20,"['PH', 'US']",PANALIGAN
4,SOFYAN ISKANDAR,1968/04/04,['ID'],NUGROHO
5,SOLOMON ANTHONY,1965/02/05,['TZ'],BANDIHO
6,ROLAND,1969/07/21,"['US', 'DE']",AGUILAR
7,FERNANDO,1972/07/25,['MX'],RODRIGUEZ
8,RAUL,1966/12/08,['US'],ORTEGA
9,DANIEL,1962/08/30,['US'],LEIJA
10,FRANCISCO,1961/10/23,['EC'],MARTINEZ
11,HORACIO CARLOS,1963/09/10,"['US', 'MX']",TERAN
12,FREDIS RENTERIA,1965/07/07,['CO'],TRUJILLO
13,JUAN EXEQUIEL,1968/08/18,['AR'],HEINZ
14,JIMMY JULIUS,1971/05/03,"['IL', 'US']",KAROW
15,JOHN,1959/10/28,['LY'],LOWRY
16,FIDEL,1959/07/25,['CO'],CASTRO MURILLO
17,EUDES,1968/12/20,['CO'],OJEDA OVANDO
18,BEJARNI,1968/07/12,"['US', 'NI']",RIVAS
19,DAVID,1973/12/02,['GT'],ALDANA
20,SLOBODAN,1952/10/02,['BA'],RIS
21,ALEN,1978/05/27,['BA'],DEMIROVIC
22,DRAGAN,1987/02/09,['ME'],GAJIC
23,JOZO,1968/03/03,"['HR', 'BA']",BRICO
24,ZHIYIN,1962/07/01,['CN'],XU
25,NOVAK,1955/04/10,['BA'],DUKIC
26,NEBOJSA,1973/01/08,['BA'],MILANOVIC
27,MURADIF,1960/04/12,['BA'],HAMZABEGOVIC
28,BOSKO,1940/11/25,"['RS', 'BA']",LUKIC
29,RATKO,1967/05/16,['BA'],SAMAC
30,BOGDAN,1973/04/05,['BA'],BOZIC
31,ZELJKO,1965/10/21,"['BA', 'HR']",RODIN
32,SASA,1973/04/19,['RS'],DUNOVIC
33,OBRAD,1964/03/10,['BA'],OZEGOVIC
34,SENAD,1981/03/01,['BA'],KAJTEZOVIC
35,MLADEN,1973/04/29,"['HR', 'BA']",MARKOVIC
36,PERO,1972/01/29,"['BA', 'HR']",MAJIC
37,MARCO,1968/04/12,"['BA', 'HR']",VIDOVIC
38,MIRSAD,1964/07/27,['HR'],SMAJIC
39,NIJAZ,1961/11/20,,SMAJIC
40,GOJKO,1959/10/08,['BA'],BORJAN
41,DUSAN,1954/06/25,"['RS', 'BA']",SPASOJEVIC
42,MIRSAD,1991/04/20,['BA'],CERIMOVIC
43,GORAN,1962/01/24,['BA'],TESIC
44,IZET,1970/09/18,"['RS', 'BA']",REDZOVIC
45,DRAGAN,1973/09/30,['BA'],STOJIC
46,MILOJKO,1962/05/19,"['BA', 'RS']",KOVACEVIC
47,DRAGAN,1971/11/07,"['RS', 'BA']",MARJANOVIC
48,ALEKSANDAR,1979/09/22,"['AT', 'BA']",RUZIC
49,MIRKO,1992/04/29,['BA'],ATELJEVIC
50,SLAVOJKA,1967/01/13,['BA'],MARINKOVIC
51,SLADAN,1968/03/09,"['BA', 'RS']",TASIC
52,ESED,1963/01/12,['BA'],ABDAGIC
53,DRAGOMIR,1954/01/29,"['RS', 'BA']",KEZUNOVIC
54,NEDZAD,1961/01/01,['BA'],KAHRIMANOVIC
55,NEVEN,1980/10/08,"['BA', 'SI']",STANIC
56,VISNJA,1972/04/12,"['RS', 'BA']",ACIMOVIC
57,MLADEN,1974/08/05,"['HR', 'DE', 'BA']",DZIDIC
58,IVICA,1964/12/23,"['BA', 'HR']",KOLOBARA
59,ZORAN,1963/11/08,"['BA', 'RS']",ADAMOVIC

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM