简体   繁体   中英

Python: Efficient split column in pandas DF

Suppose I have a DF that contains a column of a form

0     A.1
1     A.2
2     B.3
3     4.C

And suppose that I want to split this columns by '.' using only the element after '.'. A naive way to do that would be

for i in range(len(tbl)):
  tbl['column_name'].iloc[i] = tbl['column_name'].iloc[i].split('.',1)[1] 

This works. And it's very slow for large tables. Does anyone have an idea about how to speed up the process? I can use new columns in the DF so I am not restricted to changing the source column (as i reuse it in the example). Thanks!

pandas has string methods that do such things efficiently without loops (which kill performance). In this case, you can use .str.split :

>> import pandas as pd
>> df = pd.DataFrame({'a': ['A.1', 'A.2', 'B.3', 'C.4']})
>> df
    a
0   A.1
1   A.2
2   B.3
3   C.4
>> df.a.str.split('.').apply(pd.Series)
    0   1
0   A   1
1   A   2
2   B   3
3   C   4

For a large dataframe, it's must faster to use map rather than a for loop:

%timeit df['newcol']  = df.column_name.map(lambda x: x.split('.')[1])
100 loops, best of 3: 10.7 ms per loop

%timeit for i in range(len(df)): df['newcol'].iloc[i] = df['column_name'].iloc[i].split('.',1)[1]
1 loops, best of 3: 7.63 s per loop

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM