简体   繁体   中英

Subsetting strings present in a column of a dataframe, depending on value of another column - Pandas

I have a dataframe having 2 columns A and B say, containing strings and integers respectively. For example, consider the following data.

df = pd.DataFrame({'A': ["xxxdddrrrfvhdddfff", "trdyuuweewy", "oooeereghtyuj"], 'B':[3, 2, 6]})

Now, I have to create another column C, where for each index i, df['C'][i] will contain the string s, where s is the string staring from the df['B'][i] -th character of the string df['A'][i] . For the above example the output will be:

            A         B                C
xxxdddrrrfvhdddfff    3    xdddrrrfvhdddfff
trdyuuweewy           2    rdyuuweewy 
oooeereghtyuj         6    reghtyuj
  

This can be done using lambdas or for loops very easily.

My attempt:

df['C']=df.apply(lambda x: xA[x['B']:], axis=1)

But my dataset is huge in size (contains around 5 million rows) - so using loops or lambdas are not efficient at all. How can I do this efficiently without using lambdas or loops? Any suggestion is highly appreciated. Thank you.

You can avoid using pandas apply and make it more efficient using native python. Kindly try the following:

df['C'] = [x[y-1:] for x,y in zip(df['A'],df['B'])]

I tested using 30000 rows and 1000 iterations:

df = pd.DataFrame({'A': ["xxxdddrrrfvhdddfff", "trdyuuweewy", "oooeereghtyuj"]*1000, 'B':[3, 2, 6]*1000})
times_zip = []
times_apply = []

for i in range(1000):
    start = time.time()
    df['C'] = [x[y-1:] for x,y in zip(df['A'],df['B'])]
    end = time.time()
    times_zip.append(end-start)
    
for i in range(1000):
    start = time.time()
    df['C']=df.apply(lambda x: x.A[x['B']:], axis=1)
    end = time.time()
    times_apply.append(end-start)

The average time per execution using apply is:

0.035329506397247315

Whereas the average time using zip was:

0.0006626224517822265

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM