简体   繁体   中英

Greping rows of Pandas data frame based on a list

I have the following data frame:

            ProbeGenes  sample1  sample2  sample3
0      1431492_at Lipn     20.3      130        1
1   1448678_at Fam118a     25.3      150        2
2  1452580_a_at Mrpl21      3.1      173       12

It's created using this code:

import pandas as pd
df = pd.DataFrame({'ProbeGenes' : ['1431492_at Lipn', '1448678_at Fam118a','1452580_a_at Mrpl21'],
                   'sample1' : [20.3, 25.3,3.1],
                   'sample2' : [130, 150,173],        
                   'sample3' : [1.0, 2.0,12.0],         
                   })

What I want to do then, is given a list:

list_to_grep = ["Mrpl21","lipn","XXX"]

I would like to extract (grep) the df subset where ProbeGenes column members is contained inside list_to_grep , yielding:

            ProbeGenes  sample1  sample2  sample3
      1431492_at Lipn     20.3      130        1
  1452580_a_at Mrpl21      3.1      173       12

Ideally the grepping is in case-insensitive mode. How can I achieve that?

Your example doesn't really need the use of regular expressions.

Define a function that returns whether a given string contains any element of the list.

list_to_grep = ['Mrpl21', 'lipn', 'XXX']
def _grep(x, list_to_grep):
    """takes a string (x) and checks whether any string from a given 
       list of strings (list_to_grep) exists in `x`"""
    for text in list_to_grep:
        if text.lower() in x.lower():
            return True
    return False

Create a mask:

mask = df.ProbeGenes.apply(_grep, list_to_grep=list_to_grep)

Filter the data frame using this mask:

df[mask]

This outputs:

            ProbeGenes  sample1  sample2  sample3
0      1431492_at Lipn     20.3      130        1
2  1452580_a_at Mrpl21      3.1      173       12

Note, this works well for small datasets, but I've experienced unreasonably long times applying functions to text columns in big data frames (~ 10 GB), where applying the function to a list take much less time and I don't know why

For reasons that are beyond me, something like this allows me to filter much faster

>>> from functools import partial
>>> mylist = df.ProbeGenes.tolist()
>>> _greppy = partial(_grep, list_to_grep=list_to_grep)
>>> mymask = list(map(_greppy, mylist))
>>> df[mymask]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM