提高熊猫数据框的性能

Question

I am trying to encode the person_id values.我正在尝试对 person_id 值进行编码。 First I am creating a dict that stores that person_id values and then add the values in a new column.首先，我创建一个存储该 person_id 值的字典，然后将这些值添加到新列中。 It took around 25 mins to process 70K rows of data.处理 7 万行数据大约需要 25 分钟。

Dataset : https://www.kaggle.com/gspmoreira/articles-sharing-reading-from-cit-deskdrop数据集： https : //www.kaggle.com/gspmoreira/articles-sharing-reading-from-cit-deskdrop

interactions_df = pd.read_csv('./users_interactions.csv')

personId_map = {}
personId_len = range(0,len(set(interactions_df['personId'])))

for i in zip(personId_len, set(interactions_df['personId'])):
    personId_map[i[0]] = i[1]

Run跑

%%time

def transform_person_id(row):
    if row['personId'] in personId_map.values():
        return int([k for k,v in personId_map.items() if v == row['personId']][0])

interactions_df['new_personId'] = interactions_df.apply(lambda x: transform_person_id(x), axis=1)
interactions_df.head(3)

Time consumed消耗时间

CPU times: user 25min 46s, sys: 1.58 s, total: 25min 48s
Wall time: 25min 50s

How can I optimize the above code.如何优化上面的代码。

Answer 1

If there is no special rules for ordering, use factorize :如果没有特殊的排序规则，使用factorize ：

interactions_df['new_personId'] = pd.factorize(interactions_df.personId)[0]

If need also dictionary:如果还需要字典：

i, v = pd.factorize(interactions_df.personId)
personId_map = dict(zip(i, v[i]))

Data - First 20 rows for test: Data - 测试的前 20 行：

interactions_df = pd.read_csv('./users_interactions.csv', nrows=20, usecols=['personId'])

#print (interactions_df)

personId_map = {}
personId_len = range(0,len(set(interactions_df['personId'])))

for i in zip(personId_len, set(interactions_df['personId'])):
    personId_map[i[0]] = i[1]

#print (personId_map)

def transform_person_id(row):
    if row['personId'] in personId_map.values():
        return int([k for k,v in personId_map.items() if v == row['personId']][0])

interactions_df['new_personId'] = interactions_df.apply(lambda x: transform_person_id(x), axis=1)
interactions_df['new_personId1'] = pd.factorize(interactions_df.personId)[0]

print (interactions_df)
               personId  new_personId  new_personId1
0  -8845298781299428018             3              0
1  -1032019229384696495             5              1
2  -1130272294246983140             9              2
3    344280948527967603             6              3
4   -445337111692715325             0              4
5  -8763398617720485024            10              5
6   3609194402293569455             4              6
7   4254153380739593270             8              7
8    344280948527967603             6              3
9   3609194402293569455             4              6
10  3609194402293569455             4              6
11  1908339160857512799            11              8
12  1908339160857512799            11              8
13  1908339160857512799            11              8
14  7781822014935525018             1              9
15  8239286975497580612             2             10
16  8239286975497580612             2             10
17  -445337111692715325             0              4
18  2766187446275090740             7             11
19  1908339160857512799            11              8

i, v = pd.factorize(interactions_df.personId)
d = dict(zip(i, v[i]))
print (d)
{0: -8845298781299428018, 1: -1032019229384696495, 2: -1130272294246983140, 
 3: 344280948527967603, 4: -445337111692715325, 5: -8763398617720485024, 
 6: 3609194402293569455, 7: 4254153380739593270, 8: 1908339160857512799,
 9: 7781822014935525018, 10: 8239286975497580612, 11: 2766187446275090740}

Performance :性能：

interactions_df = pd.read_csv('./users_interactions.csv', usecols=['personId'])

#print (interactions_df)

In [243]: %timeit interactions_df['new_personId'] = pd.factorize(interactions_df.personId)[0]
2.03 ms ± 15.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

提高熊猫数据框的性能

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-04-12 10:38:12

提高熊猫数据框的性能

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-04-12 10:38:12

解决方案1
1 已采纳 2019-04-12 10:38:12