I have a large dataframe where each row contains a string. I want to split each string into several columns, and also replace two character types.
The code below does the job, but it is slow on a large dataframe. Is there a faster way than using a for loop?
import re
import pandas as pd
df = pd.DataFrame(['[3.4, 3.4, 2.5]', '[3.4, 3.4, 2.5]'])
df_new = pd.DataFrame({'col1': [0,0], 'col2': [0,0], 'col3': [0,0]})
for i in range(df.shape[0]):
df_new.iloc[i, :] = re.split(',', df.iloc[i, 0].replace('[', '').replace(']', ''))
Your solution should be changed with Series.str.strip
and Series.str.split
:
df1 = df[0].str.strip('[]').str.split(', ', expand=True).add_prefix('col')
print(df1)
col0 col1 col2
0 3.4 3.4 2.5
1 3.4 3.4 2.5
If performance is important use list comprehension instead pandas functions:
df1 = pd.DataFrame([x.strip('[]').split(', ') for x in df[0]]).add_prefix('col')
Timings :
#20k rows
df = pd.concat([df] * 10000, ignore_index=True)
In [208]: %timeit df[0].str.strip('[]').str.split(', ', expand=True).add_prefix('col')
61.5 ms ± 1.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [209]: %timeit pd.DataFrame([x.strip('[]').split(', ') for x in df[0]]).add_prefix('col')
29.8 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
You can do it with:
import pandas as pd
df = pd.DataFrame(['[3.4, 3.4, 2.5]', '[3.4, 3.4, 2.5]'])
df_new = df[0].str[1:-1].str.split(",", expand=True)
df_new.columns = ["col1", "col2", "col3"]
The idea is to first get rid of the [
and ]
and then split by ,
and expand the dataframe. The last step would be to rename the columns.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.