简体   繁体   English

Pandas 索引,在数据帧中搜索

[英]Pandas indexing, searching in dataframes

Problem solved问题解决了

Using loc istead of iloc solves the problem but I'm not sure why.使用loc代替iloc可以解决问题,但我不知道为什么。


Medium size of dataframe (80766, 19), composed of ints, floats and dates.中等大小的数据框 (80766, 19),由整数、浮点数和日期组成。 While my work, I noticed my results were strange.在我工作的时候,我注意到我的结果很奇怪。 I started transforming and simplifying expressions to see where the problem was, and came into a contradiction.我开始转换和简化表达式,看看问题出在哪里,结果出现了矛盾。

Using these two lines I got the same result (as expected):使用这两行我得到了相同的结果(如预期的那样):

import pandas
...
data_table[data_table[col_name] == 69][col_name]
data_table.iloc[data_table.index[data_table[col_name] == 69]][col_name]

Result:结果:

23270    69
23271    69
         ..
25059    69
Name: BBCH, Length: 1790, dtype: int64

But when I changed the searched value for higher, the second line gives a completely incorrect result.但是当我将搜索值更改为更高时,第二行给出的结果完全不正确。

data_table[data_table[col_name] == 71][col_name]

Gives good result:给出了很好的结果:

39556    71
39557    71
         ..
41353    71
Name: BBCH, Length: 1798, dtype: int64

And for而对于

data_table.iloc[data_table.index[data_table[col_name] == 71]][col_name]

the result is:结果是:

7336    30
7337    30
        ..
9133    30
Name: BBCH, Length: 1798, dtype: int64

My question is why is it that?我的问题是为什么会这样? Is it a problem with size of data?是数据大小的问题吗?

As long as your index is a RangeIndex , ie, has no gaps, you can use loc and iloc interchangeably, eg,只要您的索引是RangeIndex ,即没有间隙,您就可以互换使用lociloc ,例如,

>>> s = pd.Series('foo', index=range(10))
>>> s
0    foo
1    foo
2    foo
3    foo
4    foo
5    foo
6    foo
7    foo
8    foo
9    foo
dtype: object
>>> s.loc[[1, 2, 7]]
1    foo
2    foo
7    foo
dtype: object
>>> s.iloc[[1, 2, 7]]
1    foo
2    foo
7    foo
dtype: object

But s.loc[[1, 2, 7]] selects the rows that are labelled 1 , 2 , and 7 , no matter their position, while iloc extracts the rows that are at the positional indices 1 , 2 , and 7 .但是s.loc[[1, 2, 7]]选择标记为127的行,无论它们的位置如何,而iloc提取位于位置索引127的行。 If you changed the order of the rows in s , loc would still give the same rows, but iloc would give whatever ends up at the second, third, and 8th row.如果您更改s中的行顺序, loc仍会给出相同的行,但iloc会给出第二、第三和第 8 行的任何内容。

If you modify your data so that it's no longer a RangeIndex (ie, there are rows missing if you will), loc and iloc will give different result once they select something that follows a "missing row".如果您修改数据以使其不再是RangeIndex (即,如果您愿意,可能会丢失行),一旦lociloc选择了“缺失行”之后的内容,它们就会给出不同的结果。 Hence in the example below, with the modified s , the rows at index 1 and 2 are still labelled as 1 and 2 , so they are selected by both loc and iloc , but the 8th row is no longer labelled 7 , but 9 (as we removed two rows in the middle).因此在下面的示例中,使用修改后s ,索引12处的行仍标记为12 ,因此它们被lociloc选择,但第 8 行不再标记为7 ,而是9 (如我们删除了中间的两行)。

>>> s = s.drop([3, 4])
>>> s
0    foo # position = 0
1    foo # 1
2    foo # 2
5    foo # 3 but label == 5!!
6    foo # 4 but label == 6
7    foo # etc.
8    foo
9    foo
dtype: object
>>> s.loc[[1, 2, 7]]
1    foo
2    foo
7    foo
dtype: object
>>> s.iloc[[1, 2, 7]]
1    foo
2    foo
9    foo # != 7 !!
dtype: object

That explains why in the first case, your result was correct, but in the second case, something caused the labels of the index to be "out of sync" with the positional values (probably some dropped rows).这就解释了为什么在第一种情况下,您的结果是正确的,但在第二种情况下,某些原因导致索引的标签与位置值“不同步”(可能是一些删除的行)。 As you selected by subsetting the labels of .index , you need loc , not iloc .当您通过子集.index标签进行选择时,您需要loc ,而不是iloc (If you did a reset_index before subsetting, iloc would work again, because then the index would again be identical to the positions of the rows.) (如果您在子集之前执行了reset_indexiloc将再次起作用,因为这样索引将再次与行的位置相同。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM