简体   繁体   English

Pandas 中的 Hash 表映射

[英]Hash table mapping in Pandas

I have a large dataset with millions of rows of data.我有一个包含数百万行数据的大型数据集。 One of the data columns is ID.数据列之一是 ID。

I also have another (hash)table that maps the range of indices to a specific group that meets a certain criteria.我还有另一个(哈希)表,它将索引范围映射到满足特定标准的特定组。

What is an efficient way to map the range of indices to include them as an additional column on my dataset in pandas? map 索引范围以将它们作为附加列包含在 pandas 中的我的数据集上的有效方法是什么?

As an example, lets say that the dataset looks like this:例如,假设数据集如下所示:

In [18]:
print(df_test)

Out [19]:
    ID
0   13
1   14
2   15
3   16
4   17
5   18
6   19
7   20
8   21
9   22
10  23
11  24
12  25
13  26
14  27
15  28
16  29
17  30
18  31
19  32

Now the hash table with the range of indices looks like this:现在具有索引范围的 hash 表如下所示:

In [20]:
print(df_hash)

Out [21]:
   ID_first
0         0
1         2
2        10

where the index specifies the group number that I need.其中索引指定了我需要的组号。

I tried doing something like this:我试着做这样的事情:

for index in range(df_hash.size):
    try:
        df_test.loc[df_hash.ID_first[index]:df_hash.ID_first[index + 1], 'Group'] = index
    except:
        df_test.loc[df_hash.ID_first[index]:, 'Group'] = index

Which works well, except that it is really slow as it loops over the length of the hash table dataframe (hundreds of thousands of rows).效果很好,除了它真的很慢,因为它在 hash 表 dataframe (数十万行)的长度上循环。 It produces the following answer (which I want):它产生以下答案(我想要):

In [23]:
print(df_test)

Out [24]:
    ID  Group
0   13    0
1   14    0
2   15    1
3   16    1
4   17    1
5   18    1
6   19    1
7   20    1
8   21    1
9   22    1
10  23    2
11  24    2
12  25    2
13  26    2
14  27    2
15  28    2
16  29    2
17  30    2
18  31    2
19  32    2

Is there a way to do this more efficiently?有没有办法更有效地做到这一点?

You can map the index of df_test using ID_first to the index of df_hash, and then ffill .您可以map的索引使用 ID_first 到 df_hash 的索引,然后ffill Need to construct a Series as the pd.Index class doesn't have a ffill method.需要构造一个系列,因为 pd.Index class 没有填充方法。

df_test['group'] = (pd.Series(df_test.index.map(dict(zip(df_hash.ID_first, df_hash.index))), 
                              index=df_test.index)
                      .ffill(downcast='infer'))

#    ID  group
#0   13      0
#1   14      0
#2   15      1
#...
#9   22      1
#10  23      2
#...
#17  30      2
#18  31      2
#19  32      2

you can do series.isin with series.cumsum你可以用series.cumsumseries.isin

df_test['group'] = df_test['ID'].isin(df_hash['ID_first']).cumsum() #.sub(1)

print(df_test)

    ID  group
0    0      1
1    1      1
2    2      2
3    3      2
4    4      2
5    5      2
6    6      2
7    7      2
8    8      2
9    9      2
10  10      3
11  11      3
12  12      3
13  13      3
14  14      3
15  15      3
16  16      3
17  17      3
18  18      3
19  19      3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM