简体   繁体   English

pandas query()方法中的bug?

[英]Bug in pandas query() method?

I was experimenting several use cases for the pandas query() method, and tried one argument that threw an exception, but yet caused an unwanted modification to the data in my DataFrame. 我正在为pandas query()方法试验几个用例,并尝试了一个引发异常的参数,但却导致对我的DataFrame中的数据进行不必要的修改。

In [549]: syn_fmax_sort
Out[549]: 
     build_number      name    fmax
0             390     adpcm  143.45
1             390       aes  309.60
2             390     dfadd  241.02
3             390     dfdiv   10.80
....
211           413     dfmul  215.98
212           413     dfsin   11.94
213           413       gsm  194.70
214           413      jpeg  197.75
215           413      mips  202.39
216           413     mpeg2  291.29
217           413       sha  243.19

[218 rows x 3 columns]

So I wanted to use query() to just take out a subset of this dataframe that contains all the build_number of 392, so I tried: 所以我想使用query()来取出包含所有build_number为392的数据帧的子集,所以我尝试了:

In [550]: syn_fmax_sort.query('build_number = 392')

That threw a ValueError: cannot label index with a null key exception, but not only that, it returned back the full dataframe to me,and caused all the build_number to be set to 392: 抛出一个ValueError: cannot label index with a null key异常ValueError: cannot label index with a null key ,但不仅如此,它build_number完整的数据帧返回给我,并导致所有build_number都设置为392:

In [551]: syn_fmax_sort
Out[551]: 
     build_number      name    fmax
0             392     adpcm  143.45
1             392       aes  309.60
2             392     dfadd  241.02
3             392     dfdiv   10.80
....
211           392     dfmul  215.98
212           392     dfsin   11.94
213           392       gsm  194.70
214           392      jpeg  197.75
215           392      mips  202.39
216           392     mpeg2  291.29
217           392       sha  243.19

[218 rows x 3 columns]

However, I have since figured out how to get value 392 only, if I used syn_fmax_sort.query('391 < build_number < 393') , it works/ 但是,我已经弄清楚如何只获取值392,如果我使用了syn_fmax_sort.query('391 < build_number < 393') ,它可以工作/

So my question is: Is the behavior that I observed above when I queried the dataframe wrongly due to a bug in the query() method? 所以我的问题是:当我因query()方法中的错误而错误地查询数据帧时,我上面观察到的行为是什么?

It looks like you had a typo, you probably wanted to use == rather than = , a simple example shows the same problem: 看起来你有一个拼写错误,你可能想使用==而不是= ,一个简单的例子显示了同样的问题:

In [286]:

df = pd.DataFrame({'a':np.arange(5)})
df
Out[286]:
   a
0  0
1  1
2  2
3  3
4  4
In [287]:

df.query('a = 3')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-287-41cfa0572737> in <module>()
----> 1 df.query('a = 3')

C:\WinPython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\frame.py in query(self, expr, **kwargs)
   1923             # when res is multi-dimensional loc raises, but this is sometimes a
   1924             # valid query
-> 1925             return self[res]
   1926 
   1927     def eval(self, expr, **kwargs):

C:\WinPython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   1778             return self._getitem_multilevel(key)
   1779         else:
-> 1780             return self._getitem_column(key)
   1781 
   1782     def _getitem_column(self, key):

C:\WinPython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key)
   1785         # get column
   1786         if self.columns.is_unique:
-> 1787             return self._get_item_cache(key)
   1788 
   1789         # duplicate columns & possible reduce dimensionaility

C:\WinPython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item)
   1066         res = cache.get(item)
   1067         if res is None:
-> 1068             values = self._data.get(item)
   1069             res = self._box_item_values(item, values)
   1070             cache[item] = res

C:\WinPython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath)
   2856                         loc = indexer.item()
   2857                     else:
-> 2858                         raise ValueError("cannot label index with a null key")
   2859 
   2860             return self.iget(loc, fastpath=fastpath)

ValueError: cannot label index with a null key

It looks like internally it's trying to build an index using your query and it then checks the length and as it's 0 it raises a ValueError it probably should be KeyError , I don't know how it's evaluated your query but perhaps it's unsupported at the moment the ability to assign values to columns. 看起来在内部它正在尝试使用您的查询构建索引,然后检查长度,因为它为0它会引发一个ValueError它可能应该是KeyError ,我不知道它是如何评估您的查询但也许它目前不受支持为列分配值的能力。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM