dataframe 的列中存在的子集字符串，具体取决于另一列的值 - Pandas

Question

I have a dataframe having 2 columns A and B say, containing strings and integers respectively.我有一个 dataframe 有 2 列 A 和 B 说，分别包含字符串和整数。 For example, consider the following data.例如，考虑以下数据。

df = pd.DataFrame({'A': ["xxxdddrrrfvhdddfff", "trdyuuweewy", "oooeereghtyuj"], 'B':[3, 2, 6]})

Now, I have to create another column C, where for each index i, df['C'][i] will contain the string s, where s is the string staring from the df['B'][i] -th character of the string df['A'][i] .现在，我必须创建另一列 C，其中对于每个索引 i， df['C'][i]将包含字符串 s，其中 s 是从df['B'][i] -th 开始的字符串字符串df['A'][i]的字符。 For the above example the output will be:对于上述示例，output 将是：

            A         B                C
xxxdddrrrfvhdddfff    3    xdddrrrfvhdddfff
trdyuuweewy           2    rdyuuweewy 
oooeereghtyuj         6    reghtyuj

This can be done using lambdas or for loops very easily.这可以很容易地使用 lambdas 或 for 循环来完成。

My attempt:我的尝试：

df['C']=df.apply(lambda x: xA[x['B']:], axis=1)

But my dataset is huge in size (contains around 5 million rows) - so using loops or lambdas are not efficient at all.但是我的数据集很大（包含大约 500 万行） - 所以使用循环或 lambdas 根本没有效率。 How can I do this efficiently without using lambdas or loops?如何在不使用 lambda 或循环的情况下有效地做到这一点？ Any suggestion is highly appreciated.任何建议都受到高度赞赏。 Thank you.谢谢你。

Answer 1

You can avoid using pandas apply and make it more efficient using native python.您可以避免使用 pandas 应用，并使用本机 python 提高效率。 Kindly try the following:请尝试以下方法：

df['C'] = [x[y-1:] for x,y in zip(df['A'],df['B'])]

I tested using 30000 rows and 1000 iterations:我使用 30000 行和 1000 次迭代进行了测试：

df = pd.DataFrame({'A': ["xxxdddrrrfvhdddfff", "trdyuuweewy", "oooeereghtyuj"]*1000, 'B':[3, 2, 6]*1000})
times_zip = []
times_apply = []

for i in range(1000):
    start = time.time()
    df['C'] = [x[y-1:] for x,y in zip(df['A'],df['B'])]
    end = time.time()
    times_zip.append(end-start)
    
for i in range(1000):
    start = time.time()
    df['C']=df.apply(lambda x: x.A[x['B']:], axis=1)
    end = time.time()
    times_apply.append(end-start)

The average time per execution using apply is:使用 apply 每次执行的平均时间是：

0.035329506397247315

Whereas the average time using zip was:而使用 zip 的平均时间为：

0.0006626224517822265

dataframe 的列中存在的子集字符串，具体取决于另一列的值 - Pandas

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-08-02 15:52:53

dataframe 的列中存在的子集字符串，具体取决于另一列的值 - Pandas

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-08-02 15:52:53

解决方案1
1 已采纳 2022-08-02 15:52:53