[英]pandas dataframe: len(df) is not equal to number of iterations in df.iterrows()
I have a dataframe where I want to print each row to a different file. 我有一个数据框,我想将每一行打印到不同的文件。 When the dataframe consists of eg only 50 rows,
len(df)
will print 50
and iterating over the rows of the dataframe like 当数据帧仅包含50行时,
len(df)
将打印50
并迭代数据帧的行,如
for index, row in df.iterrows():
print(index)
will print the index from 0
to 49
. 将从
0
到49
打印索引。
However, if my dataframe contains more than 50'000 rows, len(df)
and the number of iterations when iterating over df.iterrows()
differ significantly. 但是,如果我的数据帧包含超过50'000行,则
len(df)
和迭代df.iterrows()
时的迭代次数df.iterrows()
很大差异。 For example, len(df)
will say eg 50'554 and printing the index will go up to over 400'000. 例如,
len(df)
将说例如50'554并且打印索引将超过400'000。
How can this be? 怎么会这样? What am I missing here?
我在这里错过了什么?
First, as @EdChum noted in the comment, your question's title refers to iterrows
, but the example you give refers to iteritems
, which loops in the orthogonal direction to that relevant to len
. 首先,正如@EdChum在评论中指出的那样,你的问题的标题是指
iterrows
,但你给出的例子是指iteritems
,它在与len
相关的正交方向上循环。 I assume you meant iterrows
(as in the title). 我假设你的意思是
iterrows
(如标题中所示)。
Note that a DataFrame's index need not be a running index, irrespective of the size of the DataFrame. 请注意,无论DataFrame的大小如何,DataFrame的索引都不必是运行索引。 For example:
例如:
df = pd.DataFrame({'a': [1, 2, 3, 4]}, index=[2, 4, 5, 1000])
>>> for index, row in df.iterrows():
... print index
2
4
5
1000
Presumably, your long DataFrame was just created differently, then, or underwent some manipulation, affecting the index. 据推测,您的长DataFrame只是以不同方式创建,或者经历了一些操作,影响了索引。
If you really must iterate with a running index, you can use Python's enumerate
: 如果您真的必须使用正在运行的索引进行迭代,则可以使用Python的
enumerate
:
>>> for index, row in enumerate(df.iterrows()):
... print index
0
1
2
3
(Note that, in this case, row
is itself a tuple.) (注意,在这种情况下,
row
本身就是一个元组。)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.