简体   繁体   English

dataframe 的列中存在的子集字符串,具体取决于另一列的值 - Pandas

[英]Subsetting strings present in a column of a dataframe, depending on value of another column - Pandas

I have a dataframe having 2 columns A and B say, containing strings and integers respectively.我有一个 dataframe 有 2 列 A 和 B 说,分别包含字符串和整数。 For example, consider the following data.例如,考虑以下数据。

df = pd.DataFrame({'A': ["xxxdddrrrfvhdddfff", "trdyuuweewy", "oooeereghtyuj"], 'B':[3, 2, 6]})

Now, I have to create another column C, where for each index i, df['C'][i] will contain the string s, where s is the string staring from the df['B'][i] -th character of the string df['A'][i] .现在,我必须创建另一列 C,其中对于每个索引 i, df['C'][i]将包含字符串 s,其中 s 是从df['B'][i] -th 开始的字符串字符串df['A'][i]的字符。 For the above example the output will be:对于上述示例,output 将是:

            A         B                C
xxxdddrrrfvhdddfff    3    xdddrrrfvhdddfff
trdyuuweewy           2    rdyuuweewy 
oooeereghtyuj         6    reghtyuj
  

This can be done using lambdas or for loops very easily.这可以很容易地使用 lambdas 或 for 循环来完成。

My attempt:我的尝试:

df['C']=df.apply(lambda x: xA[x['B']:], axis=1)

But my dataset is huge in size (contains around 5 million rows) - so using loops or lambdas are not efficient at all.但是我的数据集很大(包含大约 500 万行) - 所以使用循环或 lambdas 根本没有效率。 How can I do this efficiently without using lambdas or loops?如何在不使用 lambda 或循环的情况下有效地做到这一点? Any suggestion is highly appreciated.任何建议都受到高度赞赏。 Thank you.谢谢你。

You can avoid using pandas apply and make it more efficient using native python.您可以避免使用 pandas 应用,并使用本机 python 提高效率。 Kindly try the following:请尝试以下方法:

df['C'] = [x[y-1:] for x,y in zip(df['A'],df['B'])]

I tested using 30000 rows and 1000 iterations:我使用 30000 行和 1000 次迭代进行了测试:

df = pd.DataFrame({'A': ["xxxdddrrrfvhdddfff", "trdyuuweewy", "oooeereghtyuj"]*1000, 'B':[3, 2, 6]*1000})
times_zip = []
times_apply = []

for i in range(1000):
    start = time.time()
    df['C'] = [x[y-1:] for x,y in zip(df['A'],df['B'])]
    end = time.time()
    times_zip.append(end-start)
    
for i in range(1000):
    start = time.time()
    df['C']=df.apply(lambda x: x.A[x['B']:], axis=1)
    end = time.time()
    times_apply.append(end-start)

The average time per execution using apply is:使用 apply 每次执行的平均时间是:

0.035329506397247315

Whereas the average time using zip was:而使用 zip 的平均时间为:

0.0006626224517822265

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 获取基于另一列的列值,其中包含pandas dataframe中的字符串列表 - get column value based on another column with list of strings in pandas dataframe 替换 pandas dataframe 列中的元素(如果存在于另一个 dataframe 列中) - Replace an element in a pandas dataframe column if present in another dataframe column 检查pandas数据帧中的列值是否存在于系列中 - Check if a column value in a pandas dataframe is present in a series 根据值是否为 null 创建 pandas dataframe 列 - Create a pandas dataframe column depending if a value is null or not 使用另一列中存在的整数截断数据框列中的字符串 - Truncate the strings in a dataframe column using the integers present in another column 通过Pandas中的另一个列和列索引来设置列 - Subsetting columns by another column and column index in Pandas 如何检查 pandas 列中的字符串列表的元素是否存在于另一列中 - How to check if elements of a list of strings in a pandas column are present in another column Pandas 数据框:根据另一列中的值操作列(不迭代行) - Pandas dataframe: Manipulate column depending on value in another column (without iterating over rows) Python - 如何根据另一列中的值更改 pandas dataframe 的一列中的值组? - Python - How to change groups of values in one column of pandas dataframe depending on a value in another column? 根据列值是否在另一列中,将列添加到PySpark DataFrame - Adding column to PySpark DataFrame depending on whether column value is in another column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM