[英]Pandas indexing, searching in dataframes
Problem solved问题解决了
Using loc
istead of iloc
solves the problem but I'm not sure why.使用
loc
代替iloc
可以解决问题,但我不知道为什么。
Medium size of dataframe (80766, 19), composed of ints, floats and dates.中等大小的数据框 (80766, 19),由整数、浮点数和日期组成。 While my work, I noticed my results were strange.
在我工作的时候,我注意到我的结果很奇怪。 I started transforming and simplifying expressions to see where the problem was, and came into a contradiction.
我开始转换和简化表达式,看看问题出在哪里,结果出现了矛盾。
Using these two lines I got the same result (as expected):使用这两行我得到了相同的结果(如预期的那样):
import pandas
...
data_table[data_table[col_name] == 69][col_name]
data_table.iloc[data_table.index[data_table[col_name] == 69]][col_name]
Result:结果:
23270 69
23271 69
..
25059 69
Name: BBCH, Length: 1790, dtype: int64
But when I changed the searched value for higher, the second line gives a completely incorrect result.但是当我将搜索值更改为更高时,第二行给出的结果完全不正确。
data_table[data_table[col_name] == 71][col_name]
Gives good result:给出了很好的结果:
39556 71
39557 71
..
41353 71
Name: BBCH, Length: 1798, dtype: int64
And for而对于
data_table.iloc[data_table.index[data_table[col_name] == 71]][col_name]
the result is:结果是:
7336 30
7337 30
..
9133 30
Name: BBCH, Length: 1798, dtype: int64
My question is why is it that?我的问题是为什么会这样? Is it a problem with size of data?
是数据大小的问题吗?
As long as your index is a RangeIndex
, ie, has no gaps, you can use loc
and iloc
interchangeably, eg,只要您的索引是
RangeIndex
,即没有间隙,您就可以互换使用loc
和iloc
,例如,
>>> s = pd.Series('foo', index=range(10))
>>> s
0 foo
1 foo
2 foo
3 foo
4 foo
5 foo
6 foo
7 foo
8 foo
9 foo
dtype: object
>>> s.loc[[1, 2, 7]]
1 foo
2 foo
7 foo
dtype: object
>>> s.iloc[[1, 2, 7]]
1 foo
2 foo
7 foo
dtype: object
But s.loc[[1, 2, 7]]
selects the rows that are labelled 1
, 2
, and 7
, no matter their position, while iloc
extracts the rows that are at the positional indices 1
, 2
, and 7
.但是
s.loc[[1, 2, 7]]
选择标记为1
、 2
和7
的行,无论它们的位置如何,而iloc
提取位于位置索引1
、 2
和7
的行。 If you changed the order of the rows in s
, loc
would still give the same rows, but iloc
would give whatever ends up at the second, third, and 8th row.如果您更改
s
中的行顺序, loc
仍会给出相同的行,但iloc
会给出第二、第三和第 8 行的任何内容。
If you modify your data so that it's no longer a RangeIndex
(ie, there are rows missing if you will), loc
and iloc
will give different result once they select something that follows a "missing row".如果您修改数据以使其不再是
RangeIndex
(即,如果您愿意,可能会丢失行),一旦loc
和iloc
选择了“缺失行”之后的内容,它们就会给出不同的结果。 Hence in the example below, with the modified s
, the rows at index 1
and 2
are still labelled as 1
and 2
, so they are selected by both loc
and iloc
, but the 8th row is no longer labelled 7
, but 9
(as we removed two rows in the middle).因此在下面的示例中,使用修改后
s
,索引1
和2
处的行仍标记为1
和2
,因此它们被loc
和iloc
选择,但第 8 行不再标记为7
,而是9
(如我们删除了中间的两行)。
>>> s = s.drop([3, 4])
>>> s
0 foo # position = 0
1 foo # 1
2 foo # 2
5 foo # 3 but label == 5!!
6 foo # 4 but label == 6
7 foo # etc.
8 foo
9 foo
dtype: object
>>> s.loc[[1, 2, 7]]
1 foo
2 foo
7 foo
dtype: object
>>> s.iloc[[1, 2, 7]]
1 foo
2 foo
9 foo # != 7 !!
dtype: object
That explains why in the first case, your result was correct, but in the second case, something caused the labels of the index to be "out of sync" with the positional values (probably some dropped rows).这就解释了为什么在第一种情况下,您的结果是正确的,但在第二种情况下,某些原因导致索引的标签与位置值“不同步”(可能是一些删除的行)。 As you selected by subsetting the labels of
.index
, you need loc
, not iloc
.当您通过子集
.index
的标签进行选择时,您需要loc
,而不是iloc
。 (If you did a reset_index
before subsetting, iloc
would work again, because then the index would again be identical to the positions of the rows.) (如果您在子集之前执行了
reset_index
, iloc
将再次起作用,因为这样索引将再次与行的位置相同。)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.