简体   繁体   中英

subset panda dataframe using list comprehension

I have a data frame A which has a column called text which are long strings. I want to keep the rows of 'A' that have any string that are in a list 'author_id' of strings.

A data frame:
Dialogue Index  author_id   text
10190       0    573660    How is that even possible?
10190       1    23442     @573660 I do apologize. 
10190       2    573661    @AAA do you still have the program for free checked bags? 

author_id list:
[573660, 573678, 5736987]

So since 573660 is in the author_id list and is in the text column of A, my expected outcome would be to keep only the second row of the data frame A:

 Dialogue   Index   author_id   text
 10190        1       23442     @573660 I do apologize. 

The most naive way of solving I can think of would be to do:

 new_A=pd.DataFrame()   
 for id in author_id:
      new_A.append(A[A['text'].str.contains(id, na=False)]

but this will take a long time.

So I come up with this solution:

[id in text for id in author_id for text in df['text'] ]

But this doesn't work for subsetting the data frame because I obtain true false values for all the strings in df['text'] for each author id.

So I created a new column in the data frame which is a combination of Dialogue and Index so I can return that in the list comprehension but it gave an error I don't know how to interpret.

A["DialogueIndex"]= df["Dialogue"].map(str) + df["Index"]

newA = [did for did in df["DialogueIndex"]  for id in author_id if df['text'].str.contains(id)  ]

error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Please help.

Simply use str.contains to see if text ever contains any of the authors in your specified list (by joining all of the authors with | )

import pandas as pd
df = pd.DataFrame({
    'Dialogue': [10190, 10190, 10190],
    'Index': [0,1,2],
    'author_id': [573660,23442,573661],
    'text': ['How is that even possible?', 
             '@573660 I do apologize.',
            '@AAA do you still have the program for free checked bags?']
})
author_id_list = [573660, 573678, 5736987]

df.text.str.contains('|'.join(list(map(str, author_id_list))))
#0    False
#1     True
#2    False
#Name: text, dtype: bool

Then you can just mask the original DataFrame :

df[df.text.str.contains('|'.join(list(map(str, author_id_list))))]
#   Dialogue  Index  author_id                     text
#1     10190      1      23442  @573660 I do apologize.

If your author_id_list is already strings, then you can get rid of the list(map(...)) and just join the original list.

You could use apply and then the check if each item in the author_id_list is in the text

df[df.text.apply(lambda x: any(str(e) in x for e in author_id_list))]


Dialogue    Index   author_id   text
1   10190   1   23442   @573660 I do apologize.

There may be a faster way to do this, but I believe this will get you the answer you are looking for

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM