简体   繁体   中英

Fastest way to apply function/dict to every element in a pandas DataFrame on selected columns

I would like to:

  • Read hundreds of tab-delimited file into pandas DataFrame
  • Decide whether to apply function based on FileNo
  • Apply function to every element on selected columns
  • Append and concatenate all DataFrames into a single frame

Sample file:

ID    FileNo    Name    A1    A2    A3
1    0     John    a-b    b-a    a-a
2    0    Carol    b-b    a-b    a-b
[...]
500    0   Steve    a-a    b-b     a-b
501    0    Jack     b-a    b-a     a-b

True dimension for each file: 2000x15000

Function: reverse the string.

flip_over = lambda x: x[::-1]
or
my_dict = {'a-b':'b-a', 'a-a':'a-a', 'b-b':'b-b', 'b-a':'a-b'}
map(my_dict)

What I currently have:

whether_to_flip = [7,15,23,36,48,85]
frames = []
base_path = "/home/user/file_"

for i in range(0, 100):
    path = base_path + str(i) + ".tsv"
    df = pd.read_csv(path, sep="\t", header=None)
    df['FileNo'] = str(i)
    if i in whether_to_flip:
          for j in range(3,6):
                 df[j] = df[j].map(my_dict)
    frames.append(df)

combined = pd.concat(frames, axis=0, ignore_index=True)

This is currently taking hours to finish reading and processing, and I hit the memory limit when I need to increase the number of files to read.

I would appreciate any help to improve this code. In particular,

  • Is this the best/fastest way to apply function?
  • Is this the best/fastest way to append and concatanate many DataFrames?

Thank you.

First, I guess you should understand how much time you lose in reading csv vs time to invert the strings.

I can see a couple of things that can speed up the program:

Avoid the loop over the columns

You can use replace and my_dict: (ref)

if i in whether_to_flip:
    df = df.replace(my_dict)
#   df = df.replace({'A1' : my_dict, 'A2' : my_dict, 'A3' : my_dict)

I think this should give considerable improvement in performance.

List comprehension to avoid .append

This can make the syntax a bit more cumbersome, but could have some tiny efficiency gain

def do_path(x):
    return base_path + str(i) + ".csv"  



[ pd.read_csv(do_path(i), sep="\t", header=None).assign(FileNo = str(i)) if i not in whether_to_flip
  else pd.read_csv(do_path(i), sep="\t", header=None).assign(FileNo = str(i)).map(my_dict)
  for i in range(0, 100)]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM