簡體   English   中英

將嵌套的DataFrame與已排序的唯一值轉換為Python中的嵌套Dictionary

[英]Convert nested DataFrame with sorted unique values, to a nested Dictionary in Python

我正在嘗試使用嵌套的DataFrame並將其轉換為嵌套的Dictionary。

這是我的原始DataFrame,具有以下唯一值:

輸入: df.head(5)

輸出:

    reviewerName                                  title    reviewerRatings
0        Charles       Harry Potter Book Seven News:...                3.0
1      Katherine       Harry Potter Boxed Set, Books...                5.0
2           Lora       Harry Potter and the Sorcerer...                5.0
3           Cait       Harry Potter and the Half-Blo...                5.0
4          Diane       Harry Potter and the Order of...                5.0

input: len(df['reviewerName'].unique())

輸出: 66130

由於有各66130個unqiue值的多個值(即“查爾斯”將發生3次),我拿了66130獨特的“reviewerName”,並指定它們都作為在新的嵌套數據框鑰匙 ,然后分配使用“title”和“reviewerRatings”作為另一個鍵層:同一嵌套DataFrame中的值。

輸入: df = df.set_index(['reviewerName', 'title']).sort_index()

輸出:

                                                       reviewerRatings
    reviewerName                               title
         Charles    Harry Potter Book Seven News:...               3.0
                    Harry Potter and the Half-Blo...               3.5
                    Harry Potter and the Order of...               4.0
       Katherine    Harry Potter Boxed Set, Books...               5.0
                    Harry Potter and the Half-Blo...               2.5
                    Harry Potter and the Order of...               5.0
...
230898 rows x 1 columns

作為第一個問題的后續,我嘗試將嵌套的DataFrame轉換為嵌套的Dictionary。

上面新嵌套的DataFrame列索引顯示第1行(第3列)中的“reviewerRatings”和第2行(第1列和第2列)中的“reviewerName”和“title”,以及下面運行df.to_dict()方法時,輸出顯示{reviewerRatingsIndexName: {(reviewerName, title): reviewerRatings}}

輸入: df.to_dict()

輸出:

{'reviewerRatings': 
 {
  ('Charles', 'Harry Potter Book Seven News:...'): 3.0, 
  ('Charles', 'Harry Potter and the Half-Blo...'): 3.5, 
  ('Charles', 'Harry Potter and the Order of...'): 4.0,   
  ('Katherine', 'Harry Potter Boxed Set, Books...'): 5.0, 
  ('Katherine', 'Harry Potter and the Half-Blo...'): 2.5, 
  ('Katherine', 'Harry Potter and the Order of...'): 5.0,
 ...}
}

但是對於我想要的輸出,我希望得到我的輸出{reviewerName: {title: reviewerRating}}這正是我在嵌套的DataFrame中排序的方式。

{'Charles': 
 {'Harry Potter Book Seven News:...': 3.0, 
  'Harry Potter and the Half-Blo...': 3.5, 
  'Harry Potter and the Order of...': 4.0},   
 'Katherine':
 {'Harry Potter Boxed Set, Books...': 5.0, 
  'Harry Potter and the Half-Blo...': 2.5, 
  'Harry Potter and the Order of...': 5.0},
...}

有沒有辦法操縱嵌套的DataFrame或嵌套的Dictionary,這樣當我運行df.to_dict()方法時,它會顯示{reviewerName: {title: reviewerRating}}

謝謝!

對每個reviewerName使用帶有lambda函數的groupby作為dictionaries ,然后通過to_dict輸出Series轉換:

print (df)
  reviewerName                             title  reviewerRatings
0      Charles  Harry Potter Book Seven News:...              3.0
1      Charles  Harry Potter Boxed Set, Books...              5.0
2      Charles  Harry Potter and the Sorcerer...              5.0
3    Katherine  Harry Potter and the Half-Blo...              5.0
4    Katherine   Harry otter and the Order of...              5.0

d = (df.groupby('reviewerName')['title','reviewerRatings']
       .apply(lambda x: dict(x.values))
       .to_dict())
print (d)

{
    'Charles': {
        'Harry Potter Book Seven News:...': 3.0,
        'Harry Potter Boxed Set, Books...': 5.0,
        'Harry Potter and the Sorcerer...': 5.0
    },
    'Katherine': {
        'Harry Potter and the Half-Blo...': 5.0,
        'Harry otter and the Order of...': 5.0
    }
}

有幾種方法。 您可以將groupbyto_dict一起to_dict ,或使用collections.defaultdict迭代行。 值得注意的是,后者的效率不一定低。

groupby + to_dict

從每個groupby對象構造一個系列,並將其轉換為字典以提供一系列字典值。 最后,通過另一個to_dict調用將其轉換為字典字典。

res = df.groupby('reviewerName')\
        .apply(lambda x: x.set_index('title')['reviewerRatings'].to_dict())\
        .to_dict()

collections.defaultdict

定義dict對象的defaultdict並逐行迭代數據幀。

from collections import defaultdict

res = defaultdict(dict)
for row in df.itertuples(index=False):
    res[row.reviewerName][row.title] = row.reviewerRatings

得到的defaultdict並不需要轉換回常規dict作為defaultdict是的子類dict

績效基准

基准測試是建立和數據相關的。 您應該使用自己的數據進行測試,看看什么效果最好。

# Python 3.6.5, Pandas 0.19.2

from collections import defaultdict
from random import sample

# construct sample dataframe
np.random.seed(0)
n = 10**4  # number of rows
names = np.random.choice(['Charles', 'Lora', 'Katherine', 'Matthew',
                          'Mark', 'Luke', 'John'], n)
books = [f'Book_{i}' for i in sample(range(10**5), n)]
ratings = np.random.randint(0, 6, n)

df = pd.DataFrame({'reviewerName': names, 'title': books, 'reviewerRatings': ratings})

def jez(df):
    return df.groupby('reviewerName')['title','reviewerRatings']\
             .apply(lambda x: dict(x.values))\
             .to_dict()

def jpp1(df):
    return df.groupby('reviewerName')\
             .apply(lambda x: x.set_index('title')['reviewerRatings'].to_dict())\
             .to_dict()

def jpp2(df):
    dd = defaultdict(dict)
    for row in df.itertuples(index=False):
        dd[row.reviewerName][row.title] = row.reviewerRatings
    return dd

%timeit jez(df)   # 33.5 ms per loop
%timeit jpp1(df)  # 17 ms per loop
%timeit jpp2(df)  # 21.1 ms per loop

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM