在选定列上将函数/字典应用于pandas DataFrame中每个元素的最快方法

Question

I would like to: 我想要：

Read hundreds of tab-delimited file into pandas DataFrame 将数百个制表符分隔的文件读入pandas DataFrame中
Decide whether to apply function based on FileNo 根据FileNo决定是否应用功能
Apply function to every element on selected columns 将功能应用于选定列上的每个元素
Append and concatenate all DataFrames into a single frame 将所有DataFrame附加并连接到一个框架中

Sample file: 样本文件：

ID    FileNo    Name    A1    A2    A3
1    0     John    a-b    b-a    a-a
2    0    Carol    b-b    a-b    a-b
[...]
500    0   Steve    a-a    b-b     a-b
501    0    Jack     b-a    b-a     a-b

True dimension for each file: 2000x15000 每个文件的真实尺寸：2000x15000

Function: reverse the string. 功能：反转字符串。

flip_over = lambda x: x[::-1]
or
my_dict = {'a-b':'b-a', 'a-a':'a-a', 'b-b':'b-b', 'b-a':'a-b'}
map(my_dict)

What I currently have: 我目前所拥有的：

whether_to_flip = [7,15,23,36,48,85]
frames = []
base_path = "/home/user/file_"

for i in range(0, 100):
    path = base_path + str(i) + ".tsv"
    df = pd.read_csv(path, sep="\t", header=None)
    df['FileNo'] = str(i)
    if i in whether_to_flip:
          for j in range(3,6):
                 df[j] = df[j].map(my_dict)
    frames.append(df)

combined = pd.concat(frames, axis=0, ignore_index=True)

This is currently taking hours to finish reading and processing, and I hit the memory limit when I need to increase the number of files to read. 当前，这需要几个小时才能完成读取和处理，当我需要增加读取文件的数量时，达到了内存限制。

I would appreciate any help to improve this code. 我将不胜感激，以改善此代码。 In particular, 尤其是，

Is this the best/fastest way to apply function? 这是应用功能的最佳/最快方法吗？
Is this the best/fastest way to append and concatanate many DataFrames? 这是追加和合并许多DataFrame的最佳/最快方法吗？

Thank you. 谢谢。

Answer 1

First, I guess you should understand how much time you lose in reading csv vs time to invert the strings. 首先，我想您应该了解在读取csv上所花费的时间与反转字符串所花费的时间。

I can see a couple of things that can speed up the program: 我可以看到一些可以加快程序速度的事情：

Avoid the loop over the columns 避免在列上循环

You can use replace and my_dict: (ref) 您可以使用replace和my_dict ：（参考）

if i in whether_to_flip:
    df = df.replace(my_dict)
#   df = df.replace({'A1' : my_dict, 'A2' : my_dict, 'A3' : my_dict)

I think this should give considerable improvement in performance. 我认为这应该可以大大改善性能。

List comprehension to avoid .append 列表理解以避免.append

This can make the syntax a bit more cumbersome, but could have some tiny efficiency gain 这可能会使语法更加繁琐，但效率可能会有所提高

def do_path(x):
    return base_path + str(i) + ".csv"  



[ pd.read_csv(do_path(i), sep="\t", header=None).assign(FileNo = str(i)) if i not in whether_to_flip
  else pd.read_csv(do_path(i), sep="\t", header=None).assign(FileNo = str(i)).map(my_dict)
  for i in range(0, 100)]

在选定列上将函数/字典应用于pandas DataFrame中每个元素的最快方法

问题描述

1 个解决方案

解决方案1
0 2017-04-10 08:46:17

在选定列上将函数/字典应用于pandas DataFrame中每个元素的最快方法

问题描述

1 个解决方案

解决方案1 0 2017-04-10 08:46:17

解决方案1
0 2017-04-10 08:46:17