將嵌套的DataFrame與已排序的唯一值轉換為Python中的嵌套Dictionary

Question

我正在嘗試使用嵌套的DataFrame並將其轉換為嵌套的Dictionary。

這是我的原始DataFrame，具有以下唯一值：

輸入： df.head(5)

輸出：

    reviewerName                                  title    reviewerRatings
0        Charles       Harry Potter Book Seven News:...                3.0
1      Katherine       Harry Potter Boxed Set, Books...                5.0
2           Lora       Harry Potter and the Sorcerer...                5.0
3           Cait       Harry Potter and the Half-Blo...                5.0
4          Diane       Harry Potter and the Order of...                5.0

input： len(df['reviewerName'].unique())

輸出： 66130

由於有各66130個unqiue值的多個值（即“查爾斯”將發生3次），我拿了66130獨特的“reviewerName”，並指定它們都作為在新的嵌套數據框鑰匙，然后分配值使用“title”和“reviewerRatings”作為另一個鍵層：同一嵌套DataFrame中的值。

輸入： df = df.set_index(['reviewerName', 'title']).sort_index()

輸出：

                                                       reviewerRatings
    reviewerName                               title
         Charles    Harry Potter Book Seven News:...               3.0
                    Harry Potter and the Half-Blo...               3.5
                    Harry Potter and the Order of...               4.0
       Katherine    Harry Potter Boxed Set, Books...               5.0
                    Harry Potter and the Half-Blo...               2.5
                    Harry Potter and the Order of...               5.0
...
230898 rows x 1 columns

作為第一個問題的后續，我嘗試將嵌套的DataFrame轉換為嵌套的Dictionary。

上面新嵌套的DataFrame列索引顯示第1行（第3列）中的“reviewerRatings”和第2行（第1列和第2列）中的“reviewerName”和“title”，以及下面運行df.to_dict()方法時，輸出顯示{reviewerRatingsIndexName: {(reviewerName, title): reviewerRatings}}

輸入： df.to_dict()

輸出：

{'reviewerRatings': 
 {
  ('Charles', 'Harry Potter Book Seven News:...'): 3.0, 
  ('Charles', 'Harry Potter and the Half-Blo...'): 3.5, 
  ('Charles', 'Harry Potter and the Order of...'): 4.0,   
  ('Katherine', 'Harry Potter Boxed Set, Books...'): 5.0, 
  ('Katherine', 'Harry Potter and the Half-Blo...'): 2.5, 
  ('Katherine', 'Harry Potter and the Order of...'): 5.0,
 ...}
}

但是對於我想要的輸出，我希望得到我的輸出{reviewerName: {title: reviewerRating}}這正是我在嵌套的DataFrame中排序的方式。

{'Charles': 
 {'Harry Potter Book Seven News:...': 3.0, 
  'Harry Potter and the Half-Blo...': 3.5, 
  'Harry Potter and the Order of...': 4.0},   
 'Katherine':
 {'Harry Potter Boxed Set, Books...': 5.0, 
  'Harry Potter and the Half-Blo...': 2.5, 
  'Harry Potter and the Order of...': 5.0},
...}

有沒有辦法操縱嵌套的DataFrame或嵌套的Dictionary，這樣當我運行df.to_dict()方法時，它會顯示{reviewerName: {title: reviewerRating}} 。

謝謝！

Answer 1

對每個reviewerName使用帶有lambda函數的groupby作為dictionaries ，然后通過to_dict輸出Series轉換：

print (df)
  reviewerName                             title  reviewerRatings
0      Charles  Harry Potter Book Seven News:...              3.0
1      Charles  Harry Potter Boxed Set, Books...              5.0
2      Charles  Harry Potter and the Sorcerer...              5.0
3    Katherine  Harry Potter and the Half-Blo...              5.0
4    Katherine   Harry otter and the Order of...              5.0

d = (df.groupby('reviewerName')['title','reviewerRatings']
       .apply(lambda x: dict(x.values))
       .to_dict())
print (d)

{
    'Charles': {
        'Harry Potter Book Seven News:...': 3.0,
        'Harry Potter Boxed Set, Books...': 5.0,
        'Harry Potter and the Sorcerer...': 5.0
    },
    'Katherine': {
        'Harry Potter and the Half-Blo...': 5.0,
        'Harry otter and the Order of...': 5.0
    }
}

Answer 2

有幾種方法。 您可以將groupby與to_dict一起to_dict ，或使用collections.defaultdict迭代行。 值得注意的是，后者的效率不一定低。

`groupby` + `to_dict`

從每個groupby對象構造一個系列，並將其轉換為字典以提供一系列字典值。 最后，通過另一個to_dict調用將其轉換為字典字典。

res = df.groupby('reviewerName')\
        .apply(lambda x: x.set_index('title')['reviewerRatings'].to_dict())\
        .to_dict()

`collections.defaultdict`

定義dict對象的defaultdict並逐行迭代數據幀。

from collections import defaultdict

res = defaultdict(dict)
for row in df.itertuples(index=False):
    res[row.reviewerName][row.title] = row.reviewerRatings

得到的defaultdict並不需要轉換回常規dict作為defaultdict是的子類dict 。

績效基准

基准測試是建立和數據相關的。 您應該使用自己的數據進行測試，看看什么效果最好。

# Python 3.6.5, Pandas 0.19.2

from collections import defaultdict
from random import sample

# construct sample dataframe
np.random.seed(0)
n = 10**4  # number of rows
names = np.random.choice(['Charles', 'Lora', 'Katherine', 'Matthew',
                          'Mark', 'Luke', 'John'], n)
books = [f'Book_{i}' for i in sample(range(10**5), n)]
ratings = np.random.randint(0, 6, n)

df = pd.DataFrame({'reviewerName': names, 'title': books, 'reviewerRatings': ratings})

def jez(df):
    return df.groupby('reviewerName')['title','reviewerRatings']\
             .apply(lambda x: dict(x.values))\
             .to_dict()

def jpp1(df):
    return df.groupby('reviewerName')\
             .apply(lambda x: x.set_index('title')['reviewerRatings'].to_dict())\
             .to_dict()

def jpp2(df):
    dd = defaultdict(dict)
    for row in df.itertuples(index=False):
        dd[row.reviewerName][row.title] = row.reviewerRatings
    return dd

%timeit jez(df)   # 33.5 ms per loop
%timeit jpp1(df)  # 17 ms per loop
%timeit jpp2(df)  # 21.1 ms per loop

將嵌套的DataFrame與已排序的唯一值轉換為Python中的嵌套Dictionary

問題描述

2 個解決方案

解決方案1
4 2019-01-16 08:33:10

解決方案2
1 已采納 2019-01-16 09:53:35

`groupby` + `to_dict`

`collections.defaultdict`

績效基准

將嵌套的DataFrame與已排序的唯一值轉換為Python中的嵌套Dictionary

問題描述

2 個解決方案

解決方案1 4 2019-01-16 08:33:10

解決方案2 1 已采納 2019-01-16 09:53:35

groupby + to_dict

collections.defaultdict

績效基准

解決方案1
4 2019-01-16 08:33:10

解決方案2
1 已采納 2019-01-16 09:53:35

`groupby` + `to_dict`

`collections.defaultdict`