Pandas：將具有多行的 JSON 列轉換為多個 dataframe 行

Question

我有一個 dataframe 有兩列： countries和year 。 countries列是 JSON，格式如下：

[{'continent': 'europe',
  'country': 'Yugoslavia',
  'income': None,
  'life_exp': None,
  'population': 4687422},
 {'continent': 'asia',
  'country': 'United Korea (former)',
  'income': None,
  'life_exp': None,
  'population': 13740000},
 {'continent': 'asia',
  'country': 'Tokelau',
  'income': None,
  'life_exp': None,
  'population': 1009},
  ...

如何將此 dataframe 轉換為：

continent | country | income | life_exp | population | year
----------+---------+--------+----------+------------+-------
europe    | Yugos   | None   | None     | 4600000    | 1800
asia      | Korea   | None   ||None     | 13000000   | 1800
asia      | Tokelau | None   | None     | 1009       | 1800

那就是將 JSON 列分成幾行及其對應的列，並添加與該行對應的年份？

我在列上使用了json_normalize() ，它為我提供了我需要的列，但我不知道如何在末尾添加年份

編輯：這是我原來的 dataframe：

df = pd.read_json('data.json')
print(df-head())

                                           countries  year
0  [{'continent': 'europe', 'country': 'Yugoslavi...  1800
1  [{'continent': 'europe', 'country': 'Svalbard'...  1801
2  [{'continent': 'europe', 'country': 'Svalbard'...  1802
3  [{'continent': 'asia', 'country': 'Wallis et F...  1803
4  [{'continent': 'asia', 'country': 'Wallis et F...  1804

國家列是具有多行數據的 JSON ，年份適用於所有數據，那么如何將其轉換為 dataframe ，每行包含所有行和相應的年份？

我知道如果我這樣做pd.DataFrame(df.countries[0])將產生 dataframe 第一行的所有國家，但我不知道如何將年份添加到新列。 我認為循環會做，但我也猜想一定有更有效的方法

編輯：這個循環會產生我需要的結果，但我認為它效率很低：

new_df = pd.DataFrame(columns=['continent', 'country', 'income', 'life_exp', 'population', 'year'])

for i in range(len(old_df)):
    temp_df = pd.DataFrame(old_df.countries[i])
    temp_df['year'] = old_df.year[i]
    new_df = new_df.append(temp_df)

應該有更好的方法吧？

Answer 1

將.join與pd.json_normalize一起使用

前任：

df = pd.DataFrame(data)
df = df.join(pd.json_normalize(df.pop('countries')))
print(df)

根據評論編輯

df = pd.DataFrame(data).explode('countries')
df = df.join(pd.json_normalize(df.pop('countries')))
print(df)

Output：

   year continent                country income life_exp  population
0  1800    europe             Yugoslavia   None     None     4687422
1  1801      asia  United Korea (former)   None     None    13740000
2  1802      asia                Tokelau   None     None        1009

Answer 2

你可以用explode試試這個：

df=df.explode('countries')
#we add to each dictionary the respective value of year with key 'year'
df['countries']=[{**dc,**{'year':y}} for dc,y in zip(df['countries'],df['year'])]
pd.DataFrame(df['countries'].tolist())

示例：

j = [{'continent': 'europe',
 'country': 'Yugoslavia',
 'income': None,
  'life_exp': None,
'population': 4687422},
{'continent': 'asia',
'country': 'United Korea (former)',
'income': None,
'life_exp': None,
'population': 13740000}]
df=pd.DataFrame({'countries':[j,j],'year':[1800,1900]})
print(df)

df=df.explode('countries')
print(df)

#Here we add the key 'year' with the respective year row value to each dictionary
df['countries']=[{**dc,**{'year':y}} for dc,y in zip(df['countries'],df['year'])]
print(df['countries'])

finaldf=pd.DataFrame(df['countries'].tolist())
print(finaldf)

Output：

original df:
                                           countries  year
0  [{'continent': 'europe', 'country': 'Yugoslavi...  1800
1  [{'continent': 'europe', 'country': 'Yugoslavi...  1900


    

df(after explode): 
                                                                                            
                                           countries  year
0  {'continent': 'europe', 'country': 'Yugoslavia...  1800
0  {'continent': 'asia', 'country': 'United Korea...  1800
1  {'continent': 'europe', 'country': 'Yugoslavia...  1900
1  {'continent': 'asia', 'country': 'United Korea...  1900


df.countries(with year added):
0    {'continent': 'europe', 'country': 'Yugoslavia', 'income': None, 'life_exp': None, 'population': 4687422, 'year': 1800}
0    {'continent': 'asia', 'country': 'United Korea (former)', 'income': None, 'life_exp': None, 'population': 13740000, 'year': 1800}
1    {'continent': 'europe', 'country': 'Yugoslavia', 'income': None, 'life_exp': None, 'population': 4687422, 'year': 1900}
1    {'continent': 'asia', 'country': 'United Korea (former)', 'income': None, 'life_exp': None, 'population': 13740000, 'year': 1900}
Name: countries, dtype: object

finaldf
  continent                country income life_exp  population  year
0    europe             Yugoslavia   None     None     4687422  1800
1      asia  United Korea (former)   None     None    13740000  1800
2    europe             Yugoslavia   None     None     4687422  1900
3      asia  United Korea (former)   None     None    13740000  1900

Answer 3

您可以使用apply方法進行向量化，然后從country列中獲取相應的標簽。 由於您有一個名為 country 的鍵，因此請在for循環之外使用它。 它看起來像這樣

attribute = ['continent', 'income', 'life_exp', 'population']

for attr in attribute:
    df[attr] = df.country.apply(lambda x: x[attr])
df['country'] = df.country.apply(lambda x: x['country'])

這里的好處是您只循環使用屬性的數量，而不是遍歷每個項目。

Answer 4

應該在explode function 中添加ignore_index=True參數，以確保以下join沒有搞砸。

df = pd.DataFrame(data).explode('countries', ignore_index=True)
df = df.join(pd.json_normalize(df.pop('countries')))
print(df)

Pandas：將具有多行的 JSON 列轉換為多個 dataframe 行

問題描述

4 個解決方案

解決方案1
0 2020-07-30 05:45:23

解決方案2
0 已采納 2020-07-30 05:55:35

解決方案3
0 2020-07-30 05:56:22

解決方案4
0 2022-01-29 14:49:59

Pandas：將具有多行的 JSON 列轉換為多個 dataframe 行

問題描述

4 個解決方案

解決方案1 0 2020-07-30 05:45:23

解決方案2 0 已采納 2020-07-30 05:55:35

解決方案3 0 2020-07-30 05:56:22

解決方案4 0 2022-01-29 14:49:59

解決方案1
0 2020-07-30 05:45:23

解決方案2
0 已采納 2020-07-30 05:55:35

解決方案3
0 2020-07-30 05:56:22

解決方案4
0 2022-01-29 14:49:59