簡體   English   中英

Pandas DataFrame 中的嵌套/遞歸 groupby 計數

[英]Nested/recursive groupby count in Pandas DataFrame

我有一個超過 200 萬行的DataFrame ,如下所示:

+-------------------+--------------+--------+----------------------------+-------------+
|   PartitionKey    |    RowKey    |  Type  |            Path            |    Name     |
+-------------------+--------------+--------+----------------------------+-------------+
| /                 | /People      | Folder | /People                    | People      |
| /People           | /index1.xlsx | File   | /People/index1.xlsx        | index1.xlsx |
| /People           | /index2.xlsx | File   | /People/index2.xlsx        | index2.xlsx |
| /People           | /index3.xlsx | File   | /People/index3.xlsx        | index3.xlsx |
| /People           | /Employees   | Folder | /People/Employees          | Employees   |
| /People/Employees | /cv1.pdf     | File   | /People/Employees/cv1.pdf  | cv1.pdf     |
| /People/Employees | /cv2.pdf     | File   | /People/Employees/cv2.pdf  | cv2.pdf     |
| /People/Employees | /cv3.pdf     | File   | /People/Employees/cv3.pdf  | cv3.pdf     |
| /                 | /Buildings   | Folder | /Buildings                 | Buildings   |
| /Buildings        | /index1.xlsx | File   | /Buildings/index1.xlsx     | index1.xlsx |
| /Buildings        | /index2.xlsx | File   | /Buildings/index2.xlsx     | index2.xlsx |
| /Buildings        | /index3.xlsx | File   | /Buildings/index3.xlsx     | index3.xlsx |
| /Buildings        | /Rooms       | Folder | /Buildings/Rooms           | Rooms       |
| /Buildings/Rooms  | /room1.pdf   | File   | /Buildings/Rooms/room1.pdf | room1.pdf   |
| /Buildings/Rooms  | /room2.pdf   | File   | /Buildings/Rooms/room2.pdf | room2.pdf   |
| /Buildings/Rooms  | /room3.pdf   | File   | /Buildings/Rooms/room3.pdf | room3.pdf   |
+-------------------+--------------+--------+----------------------------+-------------+

我想添加兩個新列: DirectFileCountRecursiveFileCount

根據文件夾到文件的Path --> PartitionKey關系,這些應該指示文件夾本身內的文件數,以及遞歸式文件夾內和所有子文件夾內的文件數。

它應該使DataFrame看起來像這樣:

+-------------------+--------------+--------+---------------------------+-------------+-----------------+--------------------+
|   PartitionKey    |    RowKey    |  Type  |           Path            |    Name     | DirectFileCount | RecursiveFileCount |
+-------------------+--------------+--------+---------------------------+-------------+-----------------+--------------------+
| /                 | /People      | Folder | /People                   | People      |               3 |                  6 |
| /People           | /index1.xlsx | File   | /People/index1.xlsx       | index1.xlsx |               0 |                  0 |
| /People           | /index2.xlsx | File   | /People/index2.xlsx       | index2.xlsx |               0 |                  0 |
| /People           | /index3.xlsx | File   | /People/index3.xlsx       | index3.xlsx |               0 |                  0 |
| /People           | /Employees   | Folder | /People/Employees         | Employees   |               3 |                  3 |
| /People/Employees | /cv1.pdf     | File   | /People/Employees/cv1.pdf | cv1.pdf     |               0 |                  0 |
| /People/Employees | /cv2.pdf     | File   | /People/Employees/cv2.pdf | cv2.pdf     |               0 |                  0 |
| /People/Employees | /cv3.pdf     | File   | /People/Employees/cv3.pdf | cv3.pdf     |               0 |                  0 |
+-------------------+--------------+--------+---------------------------+-------------+-----------------+--------------------+

我有一些可以直接計數的東西:

df_count = df.groupby(['.tag', 'PartitionKey']).size().reset_index(name='counts')
df_file_count = df_count[df_count['.tag'] == 'file'].set_index('PartitionKey')

def direct_count(row):
    if row['.tag'] == 'folder':
        try:
            return df_file_count.loc[row['path_lower']].counts
        except:
            pass

    return 0

df['DirectFileCount'] = df.apply(lambda row: direct_count(row), axis=1)

上面的代碼負責DirectFileCount並在不到 2 分鍾內完成。

2019 年 10 月 16 日更新

我完成了RecursiveFileCount ,但花了 1 小時 52 分鍾。 下面的代碼:

dfc = df[df['Type'] == 'Folder'][['PartitionKey', 'DirectFileCount']].set_index('PartitionKey').groupby('PartitionKey').sum()

def recursive_count(row):
    count = 0

    if row['Type'] == 'Folder':
        count = dfc[dfc.index.str.startswith(row['Path'])]['DirectFileCount'].sum()

    return count

df['RecursiveFileCount'] = df.apply(lambda row: recursive_count(row), axis=1)

現在讓它工作以產生我需要的結果。 但是,2.7m 行的速度相當慢,所以希望有人有提高性能的想法。

您可以使用groupby按文件夾和關聯文件進行分組。您可以使用transform來計算文件數。然后,使用Series.cumsum將它們僅分配給Type == Folder的行。最后,使用Series.cumsum計算累積和:

count_direct_file=df.groupby(df['Type'].eq('Folder').cumsum())['Type'].transform('size')-1
df['DirectFileCount']=count_direct_file.where(df['Type'].eq('Folder'))
df['RecursiveFileCount']=df['DirectFileCount'].iloc[::-1].cumsum().iloc[::-1]
print(df)

        PartitionKey        RowKey    Type                       Path  \
0                  /       /People  Folder                    /People   
1            /People  /index1.xlsx    File        /People/index1.xlsx   
2            /People  /index2.xlsx    File        /People/index2.xlsx   
3            /People  /index3.xlsx    File        /People/index3.xlsx   
4            /People    /Employees  Folder          /People/Employees   
5  /People/Employees      /cv1.pdf    File  /People/Employees/cv1.pdf   
6  /People/Employees      /cv2.pdf    File  /People/Employees/cv2.pdf   
7  /People/Employees      /cv3.pdf    File  /People/Employees/cv3.pdf   

          Name  DirectFileCount  RecursiveFileCount  
0       People              3.0                 6.0  
1  index1.xlsx              NaN                 NaN  
2  index2.xlsx              NaN                 NaN  
3  index3.xlsx              NaN                 NaN  
4    Employees              3.0                 3.0  
5      cv1.pdf              NaN                 NaN  
6      cv2.pdf              NaN                 NaN  
7      cv3.pdf              NaN                 NaN  

詳情

df['Type'].eq('Folder').cumsum()
0    1
1    1
2    1
3    1
4    2
5    2
6    2
7    2
Name: Type, dtype: int64

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM