[英]Nested/recursive groupby count in Pandas DataFrame
我有一個超過 200 萬行的DataFrame
,如下所示:
+-------------------+--------------+--------+----------------------------+-------------+
| PartitionKey | RowKey | Type | Path | Name |
+-------------------+--------------+--------+----------------------------+-------------+
| / | /People | Folder | /People | People |
| /People | /index1.xlsx | File | /People/index1.xlsx | index1.xlsx |
| /People | /index2.xlsx | File | /People/index2.xlsx | index2.xlsx |
| /People | /index3.xlsx | File | /People/index3.xlsx | index3.xlsx |
| /People | /Employees | Folder | /People/Employees | Employees |
| /People/Employees | /cv1.pdf | File | /People/Employees/cv1.pdf | cv1.pdf |
| /People/Employees | /cv2.pdf | File | /People/Employees/cv2.pdf | cv2.pdf |
| /People/Employees | /cv3.pdf | File | /People/Employees/cv3.pdf | cv3.pdf |
| / | /Buildings | Folder | /Buildings | Buildings |
| /Buildings | /index1.xlsx | File | /Buildings/index1.xlsx | index1.xlsx |
| /Buildings | /index2.xlsx | File | /Buildings/index2.xlsx | index2.xlsx |
| /Buildings | /index3.xlsx | File | /Buildings/index3.xlsx | index3.xlsx |
| /Buildings | /Rooms | Folder | /Buildings/Rooms | Rooms |
| /Buildings/Rooms | /room1.pdf | File | /Buildings/Rooms/room1.pdf | room1.pdf |
| /Buildings/Rooms | /room2.pdf | File | /Buildings/Rooms/room2.pdf | room2.pdf |
| /Buildings/Rooms | /room3.pdf | File | /Buildings/Rooms/room3.pdf | room3.pdf |
+-------------------+--------------+--------+----------------------------+-------------+
我想添加兩個新列: DirectFileCount
和RecursiveFileCount
。
根據文件夾到文件的Path
--> PartitionKey
關系,這些應該指示文件夾本身內的文件數,以及遞歸式文件夾內和所有子文件夾內的文件數。
它應該使DataFrame
看起來像這樣:
+-------------------+--------------+--------+---------------------------+-------------+-----------------+--------------------+
| PartitionKey | RowKey | Type | Path | Name | DirectFileCount | RecursiveFileCount |
+-------------------+--------------+--------+---------------------------+-------------+-----------------+--------------------+
| / | /People | Folder | /People | People | 3 | 6 |
| /People | /index1.xlsx | File | /People/index1.xlsx | index1.xlsx | 0 | 0 |
| /People | /index2.xlsx | File | /People/index2.xlsx | index2.xlsx | 0 | 0 |
| /People | /index3.xlsx | File | /People/index3.xlsx | index3.xlsx | 0 | 0 |
| /People | /Employees | Folder | /People/Employees | Employees | 3 | 3 |
| /People/Employees | /cv1.pdf | File | /People/Employees/cv1.pdf | cv1.pdf | 0 | 0 |
| /People/Employees | /cv2.pdf | File | /People/Employees/cv2.pdf | cv2.pdf | 0 | 0 |
| /People/Employees | /cv3.pdf | File | /People/Employees/cv3.pdf | cv3.pdf | 0 | 0 |
+-------------------+--------------+--------+---------------------------+-------------+-----------------+--------------------+
我有一些可以直接計數的東西:
df_count = df.groupby(['.tag', 'PartitionKey']).size().reset_index(name='counts')
df_file_count = df_count[df_count['.tag'] == 'file'].set_index('PartitionKey')
def direct_count(row):
if row['.tag'] == 'folder':
try:
return df_file_count.loc[row['path_lower']].counts
except:
pass
return 0
df['DirectFileCount'] = df.apply(lambda row: direct_count(row), axis=1)
上面的代碼負責DirectFileCount
並在不到 2 分鍾內完成。
2019 年 10 月 16 日更新
我完成了RecursiveFileCount
,但花了 1 小時 52 分鍾。 下面的代碼:
dfc = df[df['Type'] == 'Folder'][['PartitionKey', 'DirectFileCount']].set_index('PartitionKey').groupby('PartitionKey').sum()
def recursive_count(row):
count = 0
if row['Type'] == 'Folder':
count = dfc[dfc.index.str.startswith(row['Path'])]['DirectFileCount'].sum()
return count
df['RecursiveFileCount'] = df.apply(lambda row: recursive_count(row), axis=1)
現在讓它工作以產生我需要的結果。 但是,2.7m 行的速度相當慢,所以希望有人有提高性能的想法。
您可以使用groupby
按文件夾和關聯文件進行分組。您可以使用transform
來計算文件數。然后,使用Series.cumsum
將它們僅分配給Type == Folder
的行。最后,使用Series.cumsum計算累積和:
count_direct_file=df.groupby(df['Type'].eq('Folder').cumsum())['Type'].transform('size')-1
df['DirectFileCount']=count_direct_file.where(df['Type'].eq('Folder'))
df['RecursiveFileCount']=df['DirectFileCount'].iloc[::-1].cumsum().iloc[::-1]
print(df)
PartitionKey RowKey Type Path \
0 / /People Folder /People
1 /People /index1.xlsx File /People/index1.xlsx
2 /People /index2.xlsx File /People/index2.xlsx
3 /People /index3.xlsx File /People/index3.xlsx
4 /People /Employees Folder /People/Employees
5 /People/Employees /cv1.pdf File /People/Employees/cv1.pdf
6 /People/Employees /cv2.pdf File /People/Employees/cv2.pdf
7 /People/Employees /cv3.pdf File /People/Employees/cv3.pdf
Name DirectFileCount RecursiveFileCount
0 People 3.0 6.0
1 index1.xlsx NaN NaN
2 index2.xlsx NaN NaN
3 index3.xlsx NaN NaN
4 Employees 3.0 3.0
5 cv1.pdf NaN NaN
6 cv2.pdf NaN NaN
7 cv3.pdf NaN NaN
詳情:
df['Type'].eq('Folder').cumsum()
0 1
1 1
2 1
3 1
4 2
5 2
6 2
7 2
Name: Type, dtype: int64
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.