subset panda dataframe using list comprehension

Question

I have a data frame A which has a column called text which are long strings. I want to keep the rows of 'A' that have any string that are in a list 'author_id' of strings.

A data frame:
Dialogue Index  author_id   text
10190       0    573660    How is that even possible?
10190       1    23442     @573660 I do apologize. 
10190       2    573661    @AAA do you still have the program for free checked bags? 

author_id list:
[573660, 573678, 5736987]

So since 573660 is in the author_id list and is in the text column of A, my expected outcome would be to keep only the second row of the data frame A:

 Dialogue   Index   author_id   text
 10190        1       23442     @573660 I do apologize.

The most naive way of solving I can think of would be to do:

 new_A=pd.DataFrame()   
 for id in author_id:
      new_A.append(A[A['text'].str.contains(id, na=False)]

but this will take a long time.

So I come up with this solution:

[id in text for id in author_id for text in df['text'] ]

But this doesn't work for subsetting the data frame because I obtain true false values for all the strings in df['text'] for each author id.

So I created a new column in the data frame which is a combination of Dialogue and Index so I can return that in the list comprehension but it gave an error I don't know how to interpret.

A["DialogueIndex"]= df["Dialogue"].map(str) + df["Index"]

newA = [did for did in df["DialogueIndex"]  for id in author_id if df['text'].str.contains(id)  ]

error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Please help.

Answer 1

Simply use str.contains to see if text ever contains any of the authors in your specified list (by joining all of the authors with | )

import pandas as pd
df = pd.DataFrame({
    'Dialogue': [10190, 10190, 10190],
    'Index': [0,1,2],
    'author_id': [573660,23442,573661],
    'text': ['How is that even possible?', 
             '@573660 I do apologize.',
            '@AAA do you still have the program for free checked bags?']
})
author_id_list = [573660, 573678, 5736987]

df.text.str.contains('|'.join(list(map(str, author_id_list))))
#0    False
#1     True
#2    False
#Name: text, dtype: bool

Then you can just mask the original DataFrame :

df[df.text.str.contains('|'.join(list(map(str, author_id_list))))]
#   Dialogue  Index  author_id                     text
#1     10190      1      23442  @573660 I do apologize.

If your author_id_list is already strings, then you can get rid of the list(map(...)) and just join the original list.

Answer 2

You could use apply and then the check if each item in the author_id_list is in the text

df[df.text.apply(lambda x: any(str(e) in x for e in author_id_list))]


Dialogue    Index   author_id   text
1   10190   1   23442   @573660 I do apologize.

There may be a faster way to do this, but I believe this will get you the answer you are looking for

subset panda dataframe using list comprehension

Question

2 answers

solution1
0 ACCPTED 2018-08-23 20:13:46

solution2
0 2018-08-23 20:39:22

subset panda dataframe using list comprehension

Question

2 answers

solution1 0 ACCPTED 2018-08-23 20:13:46

solution2 0 2018-08-23 20:39:22

solution1
0 ACCPTED 2018-08-23 20:13:46

solution2
0 2018-08-23 20:39:22