简体   繁体   English

什么时候重置索引? loc 与 iloc 的索引差距? 最佳实践?

[英]When to reset index? loc vs iloc for gaps in index? Best practices?

I discovered a very subtle bug in my code.我在我的代码中发现了一个非常微妙的错误。 I frequently delete rows from a dataframe in my analysis.在我的分析中,我经常从数据框中删除行。 Because this will leave gaps in the index, I try to end all functions by resetting the index at the end with因为这会在索引中留下空白,所以我尝试通过在末尾重置索引来结束所有函数

df0 = df0.reset_index (drop = True)

Then I continue in the next function with然后我继续下一个功能

for row in xrange (df0.shape [0]):
    print df0.loc [row]
    print df0.iloc [row]

However, if I dont reset the index correctly, the first row might have an index of 192. The index of 192 is not the same as the row number of 0. This leads to the problem that df0.loc[row] accesses the row with index 0, and df0.iloc[row] are accessing the row with index 192. This has caused a very strange bug, in that I try to update row 0, but index 192 gets updated instead.但是,如果我没有正确重置索引,第一行的索引可能是192。192的索引与0的行号不一样。这就导致了df0.loc[row]访问该行的问题索引为 0,而 df0.iloc[row] 正在访问索引为 192 的行。这导致了一个非常奇怪的错误,因为我尝试更新第 0 行,但索引 192 被更新了。 Or vice versa.或相反亦然。

But in reality, I dont use any df0.loc() or df0.iloc() functions because they are too slow.但实际上,我不使用任何 df0.loc() 或 df0.iloc() 函数,因为它们太慢了。 My code is riddled with df0.get_value(...) and df0.set_value(...) functions because they are the fastest functions when accessing values.我的代码充满了 df0.get_value(...) 和 df0.set_value(...) 函数,因为它们是访问值时最快的函数。

And it seems that some of the functions are accessed by index, and other are accessed by row numbers?而且似乎有些功能是通过索引访问的,而其他功能是通过行号访问的? I am confused.我很迷惑。 Can someone explain to me?有人可以向我解释一下吗? What are the best practices?最佳做法是什么? Are some functions using index to access values, and other are using row numbers?是否一些函数使用索引来访问值,而其他函数使用行号? Have I misunderstood something?我是不是误解了什么? Should I always reset_index() as often I can?我应该经常 reset_index() 吗? Or never do that?还是从不这样做?

EDIT: To recap: I manually merge some rows in functions so there will be gaps in the indicies.编辑:回顾一下:我在函数中手动合并了一些行,因此索引中会有间隙。 In other functions I iterate over each row and do calculations.在其他函数中,我遍历每一行并进行计算。 However, if I have reset the index I get other calculation results than if I don't reset the index.但是,如果我重置了索引,我会得到其他计算结果,而不是不重置索引。 Why?为什么? That is my problem.那是我的问题。

.loc[] looks at index labels , which may or may not be integer-valued. .loc[]查看索引标签,它可能是也可能不是整数值。

  • If your index is [0, 1, 3] (a non-sequential integer index), .loc[2] won't find anything, because there is no index label 2 .如果您的索引是[0, 1, 3] (非连续整数索引), .loc[2]将找不到任何东西,因为没有索引标签2
  • Similarly, if your index is ['a', 'b', 'c'] (a non-integer index), .loc[2] will come up empty.同样,如果您的索引是['a', 'b', 'c'] (非整数索引), .loc[2].loc[2]空。

.iloc[] looks at index positions , which will always be integer-valued. .iloc[]查看索引位置,它总是整数值。

  • If your index is [0, 1, 3] , .loc[2] will return the row corresponding to 3 .如果您的索引是[0, 1, 3].loc[2]将返回对应于3的行。
  • If your index is ['a', 'b', 'c'] , .loc[2] will return the row corresponding to 'c' .如果您的索引是['a', 'b', 'c'].loc[2]将返回对应于'c'的行。

That's not a bug, that's just how those indexers are designed.这不是错误,这只是这些索引器的设计方式。 Whether one fits your purpose depends on the structure of your data and what you're trying to accomplish.是否符合您的目的取决于您的数据结构以及您要完成的任务。 It's hard to make a recommendation without knowing more.在不了解更多信息的情况下很难提出建议。

That said, it does sound like your code is getting kind of thorny.也就是说,听起来您的代码确实有点棘手。 Having to perform reset_index() in a bunch of different places and keep constant track of which row you're trying to update suggest that you may not be taking advantage of Pandas' ability to perform vector-based calculations across many rows and columns at once.必须在一堆不同的地方执行reset_index()并持续跟踪您要更新的行表明您可能没有利用 Pandas 一次跨多行和多列执行基于向量的计算的能力. Maybe the task you want to accomplish makes this inevitable.也许你想要完成的任务使这不可避免。 But it's worth taking some time to consider whether you can't vectorize some of what you're doing, so that you can apply it to the whole dataframe or a subset of the dataframe, rather than operating on individual cells one at a time.但是值得花一些时间考虑是否不能对正在执行的某些操作进行矢量化,以便可以将其应用于整个数据帧或数据帧的一个子集,而不是一次对单个单元格进行操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM