Pandas 中的 Hash 表映射

Question

I have a large dataset with millions of rows of data.我有一个包含数百万行数据的大型数据集。 One of the data columns is ID.数据列之一是 ID。

I also have another (hash)table that maps the range of indices to a specific group that meets a certain criteria.我还有另一个（哈希）表，它将索引范围映射到满足特定标准的特定组。

What is an efficient way to map the range of indices to include them as an additional column on my dataset in pandas? map 索引范围以将它们作为附加列包含在 pandas 中的我的数据集上的有效方法是什么？

As an example, lets say that the dataset looks like this:例如，假设数据集如下所示：

In [18]:
print(df_test)

Out [19]:
    ID
0   13
1   14
2   15
3   16
4   17
5   18
6   19
7   20
8   21
9   22
10  23
11  24
12  25
13  26
14  27
15  28
16  29
17  30
18  31
19  32

Now the hash table with the range of indices looks like this:现在具有索引范围的 hash 表如下所示：

In [20]:
print(df_hash)

Out [21]:
   ID_first
0         0
1         2
2        10

where the index specifies the group number that I need.其中索引指定了我需要的组号。

I tried doing something like this:我试着做这样的事情：

for index in range(df_hash.size):
    try:
        df_test.loc[df_hash.ID_first[index]:df_hash.ID_first[index + 1], 'Group'] = index
    except:
        df_test.loc[df_hash.ID_first[index]:, 'Group'] = index

Which works well, except that it is really slow as it loops over the length of the hash table dataframe (hundreds of thousands of rows).效果很好，除了它真的很慢，因为它在 hash 表 dataframe （数十万行）的长度上循环。 It produces the following answer (which I want):它产生以下答案（我想要）：

In [23]:
print(df_test)

Out [24]:
    ID  Group
0   13    0
1   14    0
2   15    1
3   16    1
4   17    1
5   18    1
6   19    1
7   20    1
8   21    1
9   22    1
10  23    2
11  24    2
12  25    2
13  26    2
14  27    2
15  28    2
16  29    2
17  30    2
18  31    2
19  32    2

Is there a way to do this more efficiently?有没有办法更有效地做到这一点？

Answer 1

You can map the index of df_test using ID_first to the index of df_hash, and then ffill .您可以map的索引使用 ID_first 到 df_hash 的索引，然后ffill 。 Need to construct a Series as the pd.Index class doesn't have a ffill method.需要构造一个系列，因为 pd.Index class 没有填充方法。

df_test['group'] = (pd.Series(df_test.index.map(dict(zip(df_hash.ID_first, df_hash.index))), 
                              index=df_test.index)
                      .ffill(downcast='infer'))

#    ID  group
#0   13      0
#1   14      0
#2   15      1
#...
#9   22      1
#10  23      2
#...
#17  30      2
#18  31      2
#19  32      2

Answer 2

you can do series.isin with series.cumsum你可以用series.cumsum做series.isin

df_test['group'] = df_test['ID'].isin(df_hash['ID_first']).cumsum() #.sub(1)

print(df_test)

    ID  group
0    0      1
1    1      1
2    2      2
3    3      2
4    4      2
5    5      2
6    6      2
7    7      2
8    8      2
9    9      2
10  10      3
11  11      3
12  12      3
13  13      3
14  14      3
15  15      3
16  16      3
17  17      3
18  18      3
19  19      3

Pandas 中的 Hash 表映射

问题描述

2 个解决方案

解决方案1
3 已采纳 2021-03-31 16:49:20

解决方案2
2 2021-03-31 16:43:53

Pandas 中的 Hash 表映射

问题描述

2 个解决方案

解决方案1 3 已采纳 2021-03-31 16:49:20

解决方案2 2 2021-03-31 16:43:53

解决方案1
3 已采纳 2021-03-31 16:49:20

解决方案2
2 2021-03-31 16:43:53