简体   繁体   中英

How to use str methods inside pandas query()

There appears to be a right and a wrong way to use str methods inside of pandas query. Why is the first query working as expected but the second one fails:

>>> import pandas
>>> data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
    ...         'year': [2012, 2012, 2013, 2014, 2014],
    ...         'coverage': [25, 94, 57, 62, 70]}
>>> df = pandas.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
>>> print(df.query('name.str.slice(0,1)=="J"'))
              coverage   name  year
    Cochice         25  Jason  2012
    Maricopa        62   Jake  2014
>>> 
>>> print(df.query('name.str.startswith("J")'))
<lines omitted>
    TypeError: 'Series' objects are mutable, thus they cannot be hashed

Try this trick:

In [62]: df.query("name.str.startswith('J').values")
Out[62]:
          coverage   name  year
Cochice         25  Jason  2012
Maricopa        62   Jake  2014

alternatively you can specify engine='python' :

In [63]: df.query("name.str.startswith('J')", engine='python')
Out[63]:
          coverage   name  year
Cochice         25  Jason  2012
Maricopa        62   Jake  2014

Timing: for 500K rows DF:

In [68]: df = pd.concat([df] * 10**5, ignore_index=True)

In [69]: df.shape
Out[69]: (500000, 3)

In [70]: %timeit df.query("name.str.startswith('J')", engine='python')
1 loop, best of 3: 583 ms per loop

In [71]: %timeit df.query("name.str.startswith('J').values")
1 loop, best of 3: 587 ms per loop

In [72]: %timeit df[df.name.str.startswith('J')]
1 loop, best of 3: 571 ms per loop

In [74]: %timeit df.query('name.str.slice(0,1)=="J"')
1 loop, best of 3: 482 ms per loop

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM