[英]How to speed up transferring column values from a pandas dataframe to another dataframe
I have a pandas dataframe such as:我有一个 pandas dataframe 例如:
And after a complex process I want a dataframe such as:经过一个复杂的过程,我想要一个 dataframe 例如:
So, I do this:所以,我这样做:
import pandas as pd
def complex_process(value):
values=value.split(',')
return ['results for '+x for x in values]
df=pd.DataFrame([['id1','a,b,c'],['id2','d'],['id3','e,f']],columns=['id','value'])
result_list=[]
id_list=[]
value_list=[]
for row in df.itertuples():
results=complex_process(row.value)
for result in results:
result_list.append(result)
id_list.append(row.id)
value_list.append(row.value)
df_new=pd.DataFrame()
df_new['id']=id_list
df_new['value']=value_list
df_new['result']=result_list
This takes a long time with a large dataset.对于大型数据集,这需要很长时间。 I tested the complex process and it doesn't take very long.我测试了复杂的过程,它不需要很长时间。 Is there a faster way to transfer the columns?有没有更快的方法来转移列?
Doing this operation with lists and loops is cumbersome and looping through DataFrames is computationally expensive, but pandas has lots of built-in operations so you shouldn't need to iterate through DataFrames most of the time.使用列表和循环执行此操作很麻烦,并且遍历 DataFrame 的计算成本很高,但是 pandas 有很多内置操作,因此您大部分时间都不需要遍历 DataFrame。
Since your complex_process
function is intended as a placeholder, let's apply your function to each row using .apply
, and save the results in a new row called result
:由于您的complex_process
function 旨在用作占位符,因此让我们使用.apply
将您的 function 应用于每一行,并将结果保存在名为result
的新行中:
df['result'] = df.value.apply(complex_process)
Your DataFrame will look like this:您的 DataFrame 将如下所示:
>>> df
id value results
0 id1 a,b,c [results for a, results for b, results for c]
1 id2 d [results for d]
2 id3 e,f [results for e, results for f]
Now you can use the convenient .explode
method to expand a list-like column into rows.现在您可以使用方便的.explode
方法将类似列表的列展开为行。 This will duplicate the other columns and the index, so we can reset the index as well, and drop the old index.这将复制其他列和索引,因此我们也可以重置索引,并删除旧索引。
df_new = df.explode('result').reset_index(drop=True)
Final result:最后结果:
>>> df_new
id value result
0 id1 a,b,c results for a
1 id1 a,b,c results for b
2 id1 a,b,c results for c
3 id2 d results for d
4 id3 e,f results for e
5 id3 e,f results for f
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.