Python pandas dataframe 中字典映射的最有效方法

Question

I have a dictionary of dictionaries and each contains a mapping for each column of my dataframe.我有一本字典，每个字典都包含我的 dataframe 每一列的映射。

My goal is to find the most efficient way to perform mapping for my dataframe with 1 row and 300 columns.我的目标是找到最有效的方法来为我的 dataframe 执行 1 行和 300 列的映射。

My dataframe is randomly sampled from range(mapping_size) ;我的 dataframe 是从range(mapping_size)随机采样的； and my dictionaries map values from range(mapping_size) to random.randint(mapping_size+1,mapping_size*2) .和我的字典 map 值从range(mapping_size)到random.randint(mapping_size+1,mapping_size*2) 。

I can see from the answer provided by jpp that map is possibly the most efficient way to go but I am looking for something which is even faster than map .我可以从jpp 提供的答案中看到map可能是 go 最有效的方法，但我正在寻找比map更快的方法。 Can you think of any?你能想到任何吗？ I am happy if the data structure of the input is something else instead of pandas dataframe.如果输入的数据结构是其他东西而不是 pandas dataframe，我很高兴。

Here is the code for setting up the question and results using map and replace :这是使用map设置问题和结果的代码并replace ：

# import packages
import random
import pandas as pd
import numpy as np
import timeit

# specify paramters
ncol = 300 # number of columns
nrow =  1 #number of rows
mapping_size = 10 # length of each dictionary

# create a dictionary of dictionaries for mapping
mapping_dict = {}

random.seed(123)

for idx1 in range(ncol):
    # create empty dictionary
    mapping_dict['col_' + str(idx1)] = {}
    for inx2 in range(mapping_size):
        # create dictionary of length mapping_size and maps value from range(mapping_size) to  random.randint(mapping_size +1 ,mapping_size*2)
        mapping_dict['col_' + str(idx1)][inx2+1] = random.randint(mapping_size+1,mapping_size*2)
        
# Create a dataframe with values sampled from range(mapping_size)
d={}

random.seed(123)

for idx1 in range(ncol):
    d['col_' + str(idx1)] = np.random.choice(range(mapping_size),nrow)
    
df = pd.DataFrame(data=d)

Results using map and replace :结果使用map并replace ：

%%timeit -n 20
df.replace(mapping_dict) #296 ms

%%timeit -n 20
for key in mapping_dict.keys():
    df[key] = df[key].map(mapping_dict[key]).fillna(df[key]) #221ms

%%timeit -n 20
for key in mapping_dict.keys():
    df[key] = df[key].map(mapping_dict[key]) #181ms

Answer 1

Just use pandas without python for iteration.只需使用没有 python 的 pandas for迭代。

# runtime  ~ 1s (1000rows)

# creat a map_serials with multi_index
df_dict = pd.DataFrame(mapping_dict)
obj_dict = df_dict.T.stack()

# obj_dict

    # col_0    1     10
    #          2     14
    #          3     11
    # Length: 3000, dtype: int64

# convert df to map_serials's index, df can have more then 1 row
obj_idx = pd.Series(df.values.flatten())
obj_idx.index = pd.Index(df.columns.to_list() * df.shape[0])
idx = obj_idx.to_frame().reset_index().set_index(['index', 0]).index
result = obj_dict[idx]

# handle null values
cond = result.isnull()
result[cond] = pd.Series(result[cond].index.values).str[1].values

# transform to reslut DataFrame
df_result = pd.DataFrame(result.values.reshape(df.shape))
df_result.columns = df.columns

df_result

Python pandas dataframe 中字典映射的最有效方法

问题描述

1 个解决方案

解决方案1
0 2020-12-09 02:53:07

Python pandas dataframe 中字典映射的最有效方法

问题描述

1 个解决方案

解决方案1 0 2020-12-09 02:53:07

解决方案1
0 2020-12-09 02:53:07