[英]Python most efficient way to dictionary mapping in pandas dataframe
I have a dictionary of dictionaries and each contains a mapping for each column of my dataframe.我有一本字典,每个字典都包含我的 dataframe 每一列的映射。
My goal is to find the most efficient way to perform mapping for my dataframe with 1 row and 300 columns.我的目标是找到最有效的方法来为我的 dataframe 执行 1 行和 300 列的映射。
My dataframe is randomly sampled from range(mapping_size)
;我的 dataframe 是从
range(mapping_size)
随机采样的; and my dictionaries map values from range(mapping_size)
to random.randint(mapping_size+1,mapping_size*2)
.和我的字典 map 值从
range(mapping_size)
到random.randint(mapping_size+1,mapping_size*2)
。
I can see from the answer provided by jpp that map
is possibly the most efficient way to go but I am looking for something which is even faster than map
.我可以从jpp 提供的答案中看到
map
可能是 go 最有效的方法,但我正在寻找比map
更快的方法。 Can you think of any?你能想到任何吗? I am happy if the data structure of the input is something else instead of pandas dataframe.
如果输入的数据结构是其他东西而不是 pandas dataframe,我很高兴。
Here is the code for setting up the question and results using map
and replace
:这是使用
map
设置问题和结果的代码并replace
:
# import packages
import random
import pandas as pd
import numpy as np
import timeit
# specify paramters
ncol = 300 # number of columns
nrow = 1 #number of rows
mapping_size = 10 # length of each dictionary
# create a dictionary of dictionaries for mapping
mapping_dict = {}
random.seed(123)
for idx1 in range(ncol):
# create empty dictionary
mapping_dict['col_' + str(idx1)] = {}
for inx2 in range(mapping_size):
# create dictionary of length mapping_size and maps value from range(mapping_size) to random.randint(mapping_size +1 ,mapping_size*2)
mapping_dict['col_' + str(idx1)][inx2+1] = random.randint(mapping_size+1,mapping_size*2)
# Create a dataframe with values sampled from range(mapping_size)
d={}
random.seed(123)
for idx1 in range(ncol):
d['col_' + str(idx1)] = np.random.choice(range(mapping_size),nrow)
df = pd.DataFrame(data=d)
Results using map
and replace
:结果使用
map
并replace
:
%%timeit -n 20
df.replace(mapping_dict) #296 ms
%%timeit -n 20
for key in mapping_dict.keys():
df[key] = df[key].map(mapping_dict[key]).fillna(df[key]) #221ms
%%timeit -n 20
for key in mapping_dict.keys():
df[key] = df[key].map(mapping_dict[key]) #181ms
Just use pandas without python for
iteration.只需使用没有 python 的 pandas
for
迭代。
# runtime ~ 1s (1000rows)
# creat a map_serials with multi_index
df_dict = pd.DataFrame(mapping_dict)
obj_dict = df_dict.T.stack()
# obj_dict
# col_0 1 10
# 2 14
# 3 11
# Length: 3000, dtype: int64
# convert df to map_serials's index, df can have more then 1 row
obj_idx = pd.Series(df.values.flatten())
obj_idx.index = pd.Index(df.columns.to_list() * df.shape[0])
idx = obj_idx.to_frame().reset_index().set_index(['index', 0]).index
result = obj_dict[idx]
# handle null values
cond = result.isnull()
result[cond] = pd.Series(result[cond].index.values).str[1].values
# transform to reslut DataFrame
df_result = pd.DataFrame(result.values.reshape(df.shape))
df_result.columns = df.columns
df_result
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.