[英]Convert nested DataFrame with sorted unique values, to a nested Dictionary in Python
我正在嘗試使用嵌套的DataFrame並將其轉換為嵌套的Dictionary。
這是我的原始DataFrame,具有以下唯一值:
輸入: df.head(5)
輸出:
reviewerName title reviewerRatings
0 Charles Harry Potter Book Seven News:... 3.0
1 Katherine Harry Potter Boxed Set, Books... 5.0
2 Lora Harry Potter and the Sorcerer... 5.0
3 Cait Harry Potter and the Half-Blo... 5.0
4 Diane Harry Potter and the Order of... 5.0
input: len(df['reviewerName'].unique())
輸出: 66130
由於有各66130個unqiue值的多個值(即“查爾斯”將發生3次),我拿了66130獨特的“reviewerName”,並指定它們都作為在新的嵌套數據框鑰匙 ,然后分配值使用“title”和“reviewerRatings”作為另一個鍵層:同一嵌套DataFrame中的值。
輸入: df = df.set_index(['reviewerName', 'title']).sort_index()
輸出:
reviewerRatings
reviewerName title
Charles Harry Potter Book Seven News:... 3.0
Harry Potter and the Half-Blo... 3.5
Harry Potter and the Order of... 4.0
Katherine Harry Potter Boxed Set, Books... 5.0
Harry Potter and the Half-Blo... 2.5
Harry Potter and the Order of... 5.0
...
230898 rows x 1 columns
作為第一個問題的后續,我嘗試將嵌套的DataFrame轉換為嵌套的Dictionary。
上面新嵌套的DataFrame列索引顯示第1行(第3列)中的“reviewerRatings”和第2行(第1列和第2列)中的“reviewerName”和“title”,以及下面運行df.to_dict()
方法時,輸出顯示{reviewerRatingsIndexName: {(reviewerName, title): reviewerRatings}}
輸入: df.to_dict()
輸出:
{'reviewerRatings':
{
('Charles', 'Harry Potter Book Seven News:...'): 3.0,
('Charles', 'Harry Potter and the Half-Blo...'): 3.5,
('Charles', 'Harry Potter and the Order of...'): 4.0,
('Katherine', 'Harry Potter Boxed Set, Books...'): 5.0,
('Katherine', 'Harry Potter and the Half-Blo...'): 2.5,
('Katherine', 'Harry Potter and the Order of...'): 5.0,
...}
}
但是對於我想要的輸出,我希望得到我的輸出{reviewerName: {title: reviewerRating}}
這正是我在嵌套的DataFrame中排序的方式。
{'Charles':
{'Harry Potter Book Seven News:...': 3.0,
'Harry Potter and the Half-Blo...': 3.5,
'Harry Potter and the Order of...': 4.0},
'Katherine':
{'Harry Potter Boxed Set, Books...': 5.0,
'Harry Potter and the Half-Blo...': 2.5,
'Harry Potter and the Order of...': 5.0},
...}
有沒有辦法操縱嵌套的DataFrame或嵌套的Dictionary,這樣當我運行df.to_dict()
方法時,它會顯示{reviewerName: {title: reviewerRating}}
。
謝謝!
對每個reviewerName
使用帶有lambda函數的groupby
作為dictionaries
,然后通過to_dict
輸出Series
轉換:
print (df)
reviewerName title reviewerRatings
0 Charles Harry Potter Book Seven News:... 3.0
1 Charles Harry Potter Boxed Set, Books... 5.0
2 Charles Harry Potter and the Sorcerer... 5.0
3 Katherine Harry Potter and the Half-Blo... 5.0
4 Katherine Harry otter and the Order of... 5.0
d = (df.groupby('reviewerName')['title','reviewerRatings']
.apply(lambda x: dict(x.values))
.to_dict())
print (d)
{
'Charles': {
'Harry Potter Book Seven News:...': 3.0,
'Harry Potter Boxed Set, Books...': 5.0,
'Harry Potter and the Sorcerer...': 5.0
},
'Katherine': {
'Harry Potter and the Half-Blo...': 5.0,
'Harry otter and the Order of...': 5.0
}
}
有幾種方法。 您可以將groupby
與to_dict
一起to_dict
,或使用collections.defaultdict
迭代行。 值得注意的是,后者的效率不一定低。
groupby
+ to_dict
從每個groupby
對象構造一個系列,並將其轉換為字典以提供一系列字典值。 最后,通過另一個to_dict
調用將其轉換為字典字典。
res = df.groupby('reviewerName')\
.apply(lambda x: x.set_index('title')['reviewerRatings'].to_dict())\
.to_dict()
collections.defaultdict
定義dict
對象的defaultdict
並逐行迭代數據幀。
from collections import defaultdict
res = defaultdict(dict)
for row in df.itertuples(index=False):
res[row.reviewerName][row.title] = row.reviewerRatings
得到的defaultdict
並不需要轉換回常規dict
作為defaultdict
是的子類dict
。
基准測試是建立和數據相關的。 您應該使用自己的數據進行測試,看看什么效果最好。
# Python 3.6.5, Pandas 0.19.2
from collections import defaultdict
from random import sample
# construct sample dataframe
np.random.seed(0)
n = 10**4 # number of rows
names = np.random.choice(['Charles', 'Lora', 'Katherine', 'Matthew',
'Mark', 'Luke', 'John'], n)
books = [f'Book_{i}' for i in sample(range(10**5), n)]
ratings = np.random.randint(0, 6, n)
df = pd.DataFrame({'reviewerName': names, 'title': books, 'reviewerRatings': ratings})
def jez(df):
return df.groupby('reviewerName')['title','reviewerRatings']\
.apply(lambda x: dict(x.values))\
.to_dict()
def jpp1(df):
return df.groupby('reviewerName')\
.apply(lambda x: x.set_index('title')['reviewerRatings'].to_dict())\
.to_dict()
def jpp2(df):
dd = defaultdict(dict)
for row in df.itertuples(index=False):
dd[row.reviewerName][row.title] = row.reviewerRatings
return dd
%timeit jez(df) # 33.5 ms per loop
%timeit jpp1(df) # 17 ms per loop
%timeit jpp2(df) # 21.1 ms per loop
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.