简体   繁体   中英

Filtering rows by lists in pandas dataframe

I would like to filter my dataset as follows based on two lists:

list_1=['important', 'important words', 'terms to have','limone','harry']
list_2=['additional','extra','terms','to check','estate']

In the first list_1 I have the terms that I really need to have in my rows; in list_2 I have some desirable extra terms that I might be interested in. I think the problems should be a mix of & and |condition, but I have not been able to filter the rows.

If I have

Date        Head                                   Text         
03/01/2020  Estate in vacanza              marea: cosa fare in caso di ...
03/01/2020  Cosa mangiare in estate        il limone è una spezia molto usata durante il periodo estivo
03/01/2020  NaN                            tutti pazzi per l'estate: “pronto, ma se apro le finestre per arieggiare...
03/01/2020  Harry torna in UK              il principe harry torna a buckingham palace in estate...
03/01/2020  Consigli per l'estate          Estate come proteggersi -

As you can see, the word estate occurs almost in all the rows. I would need this word, but I also would need to consider rows having 'limone' or 'harry'. So I would like to filter as follows:

estate + limone # to avoid confusion I mean select estate AND limone

or

estate + harry # to avoid confusion I mean select estate AND harry

within Head and/or Text . I do not care if I have estate in Head and limone in Text, but I would need that both words (or estate + harry) can be in the same row, no matter if in two columns rather than one. I know from one of my previous questions that I should use apply something like

df[['Head','Text']].apply(lambda x : x.str.contains(something)).any(1)

but I am having difficulties to add the condition estate + limone or estate + harry, considering two separate lists (as on the top of the question). I am currently iterating twice:

df=df[df[['Head, Text']].apply(lambda x : x.str.contains('|'.join(list_1))).any(1)]
df=df[df[['Head, Text']].apply(lambda x : x.str.contains('|'.join(list_2))).any(1)]

Is there any way to compact these two codes into one?

Output:

 Date       Head                                   Text         
 03/01/2020 Cosa mangiare in estate        il limone è una spezia molto usata durante il periodo estivo
 03/01/2020 Harry torna in UK              il principe harry torna a buckingham palace in estate...

I would appreciate if you could explain me how to set this condition in the above line of code.

I hope I understand the case correctly: we have a list of 'mandatory' words (if they are not present, the entire row is not relevant), and a list of 'desirable' words. Maybe you could do an inner join to find rows that contain both mandatory and desirable terms:

mandatory = df[(df.Head + df.Text).str.contains('|'.join(mandatory_words))]
desirable = df[(df.Head + df.Text).str.contains('|'.join(desirable_words))]
mandatory_and_desirable = pd.merge(mandatory,desirable, how='inner') 

All together:

mandatory_and_desirable = pd.merge(
    df[(df.Head + df.Text).str.contains('|'.join(mandatory_words))],
    df[(df.Head + df.Text).str.contains('|'.join(desirable_words))]
    how='inner'
    ) 

Be aware that this is case sensitive.

The first approach would be more useful if you also need to analyze only rows with mandatory words. The second approach is maybe less useful because mandatory and 'desirable' would be equivalent (if both need to be present).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM