Greping rows of Pandas data frame based on a list

Question

I have the following data frame:

            ProbeGenes  sample1  sample2  sample3
0      1431492_at Lipn     20.3      130        1
1   1448678_at Fam118a     25.3      150        2
2  1452580_a_at Mrpl21      3.1      173       12

It's created using this code:

import pandas as pd
df = pd.DataFrame({'ProbeGenes' : ['1431492_at Lipn', '1448678_at Fam118a','1452580_a_at Mrpl21'],
                   'sample1' : [20.3, 25.3,3.1],
                   'sample2' : [130, 150,173],        
                   'sample3' : [1.0, 2.0,12.0],         
                   })

What I want to do then, is given a list:

list_to_grep = ["Mrpl21","lipn","XXX"]

I would like to extract (grep) the df subset where ProbeGenes column members is contained inside list_to_grep , yielding:

            ProbeGenes  sample1  sample2  sample3
      1431492_at Lipn     20.3      130        1
  1452580_a_at Mrpl21      3.1      173       12

Ideally the grepping is in case-insensitive mode. How can I achieve that?

Answer 1

Your example doesn't really need the use of regular expressions.

Define a function that returns whether a given string contains any element of the list.

list_to_grep = ['Mrpl21', 'lipn', 'XXX']
def _grep(x, list_to_grep):
    """takes a string (x) and checks whether any string from a given 
       list of strings (list_to_grep) exists in `x`"""
    for text in list_to_grep:
        if text.lower() in x.lower():
            return True
    return False

Create a mask:

mask = df.ProbeGenes.apply(_grep, list_to_grep=list_to_grep)

Filter the data frame using this mask:

df[mask]

This outputs:

            ProbeGenes  sample1  sample2  sample3
0      1431492_at Lipn     20.3      130        1
2  1452580_a_at Mrpl21      3.1      173       12

Note, this works well for small datasets, but I've experienced unreasonably long times applying functions to text columns in big data frames (~ 10 GB), where applying the function to a list take much less time and I don't know why

For reasons that are beyond me, something like this allows me to filter much faster

>>> from functools import partial
>>> mylist = df.ProbeGenes.tolist()
>>> _greppy = partial(_grep, list_to_grep=list_to_grep)
>>> mymask = list(map(_greppy, mylist))
>>> df[mymask]

Greping rows of Pandas data frame based on a list

Question

1 answers

solution1
0 2015-05-11 03:15:37

Greping rows of Pandas data frame based on a list

Question

1 answers

solution1 0 2015-05-11 03:15:37

solution1
0 2015-05-11 03:15:37