[英]Subsetting strings present in a column of a dataframe, depending on value of another column - Pandas
I have a dataframe having 2 columns A and B say, containing strings and integers respectively.我有一个 dataframe 有 2 列 A 和 B 说,分别包含字符串和整数。 For example, consider the following data.例如,考虑以下数据。
df = pd.DataFrame({'A': ["xxxdddrrrfvhdddfff", "trdyuuweewy", "oooeereghtyuj"], 'B':[3, 2, 6]})
Now, I have to create another column C, where for each index i, df['C'][i]
will contain the string s, where s is the string staring from the df['B'][i]
-th character of the string df['A'][i]
.现在,我必须创建另一列 C,其中对于每个索引 i, df['C'][i]
将包含字符串 s,其中 s 是从df['B'][i]
-th 开始的字符串字符串df['A'][i]
的字符。 For the above example the output will be:对于上述示例,output 将是:
A B C
xxxdddrrrfvhdddfff 3 xdddrrrfvhdddfff
trdyuuweewy 2 rdyuuweewy
oooeereghtyuj 6 reghtyuj
This can be done using lambdas or for loops very easily.这可以很容易地使用 lambdas 或 for 循环来完成。
My attempt:我的尝试:
df['C']=df.apply(lambda x: xA[x['B']:], axis=1)
But my dataset is huge in size (contains around 5 million rows) - so using loops or lambdas are not efficient at all.但是我的数据集很大(包含大约 500 万行) - 所以使用循环或 lambdas 根本没有效率。 How can I do this efficiently without using lambdas or loops?如何在不使用 lambda 或循环的情况下有效地做到这一点? Any suggestion is highly appreciated.任何建议都受到高度赞赏。 Thank you.谢谢你。
You can avoid using pandas apply and make it more efficient using native python.您可以避免使用 pandas 应用,并使用本机 python 提高效率。 Kindly try the following:请尝试以下方法:
df['C'] = [x[y-1:] for x,y in zip(df['A'],df['B'])]
I tested using 30000 rows and 1000 iterations:我使用 30000 行和 1000 次迭代进行了测试:
df = pd.DataFrame({'A': ["xxxdddrrrfvhdddfff", "trdyuuweewy", "oooeereghtyuj"]*1000, 'B':[3, 2, 6]*1000})
times_zip = []
times_apply = []
for i in range(1000):
start = time.time()
df['C'] = [x[y-1:] for x,y in zip(df['A'],df['B'])]
end = time.time()
times_zip.append(end-start)
for i in range(1000):
start = time.time()
df['C']=df.apply(lambda x: x.A[x['B']:], axis=1)
end = time.time()
times_apply.append(end-start)
The average time per execution using apply is:使用 apply 每次执行的平均时间是:
0.035329506397247315
Whereas the average time using zip was:而使用 zip 的平均时间为:
0.0006626224517822265
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.