[英]How to iterate over rows in a DataFrame in Pandas
I have a pandas dataframe, df
:我有一个 pandas dataframe,
df
:
c1 c2
0 10 100
1 11 110
2 12 120
How do I iterate over the rows of this dataframe?如何迭代此 dataframe 的行? For every row, I want to be able to access its elements (values in cells) by the name of the columns.
对于每一行,我希望能够通过列名访问其元素(单元格中的值)。 For example:
例如:
for row in df.rows:
print(row['c1'], row['c2'])
I found a similar question which suggests using either of these:我发现了一个类似的问题,建议使用以下任何一种:
for date, row in df.T.iteritems():
for row in df.iterrows():
But I do not understand what the row
object is and how I can work with it.但我不明白 object
row
是什么以及如何使用它。
DataFrame.iterrows
is a generator which yields both the index and row (as a Series): DataFrame.iterrows
是一个生成器,它同时产生索引和行(作为一个系列):
import pandas as pd
df = pd.DataFrame({'c1': [10, 11, 12], 'c2': [100, 110, 120]})
df = df.reset_index() # make sure indexes pair with number of rows
for index, row in df.iterrows():
print(row['c1'], row['c2'])
10 100
11 110
12 120
How to iterate over rows in a DataFrame in Pandas?
如何在 Pandas 中遍历 DataFrame 中的行?
Iteration in Pandas is an anti-pattern and is something you should only do when you have exhausted every other option. Pandas 中的迭代是一种反模式,只有在用尽所有其他选项时才应该这样做。 You should not use any function with "
iter
" in its name for more than a few thousand rows or you will have to get used to a lot of waiting.您不应该使用名称中带有“
iter
”的任何函数超过几千行,否则您将不得不习惯大量等待。
Do you want to print a DataFrame?你想打印一个DataFrame吗? Use
DataFrame.to_string()
.使用
DataFrame.to_string()
。
Do you want to compute something?你想计算一些东西吗? In that case, search for methods in this order (list modified from here ):
在这种情况下,按此顺序搜索方法(从此处修改的列表):
for
loop)for
循环)DataFrame.apply()
: i) Reductions that can be performed in Cython, ii) Iteration in Python space DataFrame.apply()
:i)可以在 Cython 中执行的缩减,ii)Python 空间中的迭代DataFrame.itertuples()
and iteritems()
DataFrame.itertuples()
和iteritems()
DataFrame.iterrows()
iterrows
and itertuples
(both receiving many votes in answers to this question) should be used in very rare circumstances, such as generating row objects/nametuples for sequential processing, which is really the only thing these functions are useful for. iterrows
和itertuples
(在这个问题的答案中都获得了很多投票)应该在非常罕见的情况下使用,例如为顺序处理生成行对象/nametuples,这实际上是这些函数唯一有用的东西。
Appeal to Authority向当局上诉
The documentation page on iteration has a huge red warning box that says:迭代的文档页面有一个巨大的红色警告框,上面写着:
Iterating through pandas objects is generally slow.
遍历 pandas 对象通常很慢。 In many cases, iterating manually over the rows is not needed [...].
在许多情况下,不需要手动迭代行 [...]。
* It's actually a little more complicated than "don't". * 它实际上比“不要”要复杂一些。
df.iterrows()
is the correct answer to this question, but "vectorize your ops" is the better one. df.iterrows()
是这个问题的正确答案,但“矢量化你的操作”是更好的答案。 I will concede that there are circumstances where iteration cannot be avoided (for example, some operations where the result depends on the value computed for the previous row).我承认在某些情况下无法避免迭代(例如,某些操作的结果取决于为前一行计算的值)。 However, it takes some familiarity with the library to know when.
但是,需要对库有一定的了解才能知道何时。 If you're not sure whether you need an iterative solution, you probably don't.
如果您不确定是否需要迭代解决方案,您可能不需要。 PS: To know more about my rationale for writing this answer, skip to the very bottom.
PS:要了解更多关于我写这个答案的理由,请跳到最底部。
A good number of basic operations and computations are "vectorised" by pandas (either through NumPy, or through Cythonized functions).大量的基本操作和计算由 pandas“矢量化”(通过 NumPy 或通过 Cythonized 函数)。 This includes arithmetic, comparisons, (most) reductions, reshaping (such as pivoting), joins, and groupby operations.
这包括算术、比较、(大多数)归约、重塑(例如旋转)、连接和 groupby 操作。 Look through the documentation on Essential Basic Functionality to find a suitable vectorised method for your problem.
查看有关基本基本功能的文档,为您的问题找到合适的矢量化方法。
If none exists, feel free to write your own using custom Cython extensions .如果不存在,请随意使用自定义Cython 扩展编写自己的。
List comprehensions should be your next port of call if 1) there is no vectorized solution available, 2) performance is important, but not important enough to go through the hassle of cythonizing your code, and 3) you're trying to perform elementwise transformation on your code.如果 1) 没有可用的矢量化解决方案,列表推导应该是您的下一个停靠点,2) 性能很重要,但还不足以解决对代码进行 cythonizing 的麻烦,以及 3) 您正在尝试执行元素转换在你的代码上。 There is a good amount of evidence to suggest that list comprehensions are sufficiently fast (and even sometimes faster) for many common Pandas tasks.
有大量证据表明,对于许多常见的 Pandas 任务,列表理解足够快(有时甚至更快)。
The formula is simple,公式很简单,
# Iterating over one column - `f` is some function that processes your data
result = [f(x) for x in df['col']]
# Iterating over two columns, use `zip`
result = [f(x, y) for x, y in zip(df['col1'], df['col2'])]
# Iterating over multiple columns - same data type
result = [f(row[0], ..., row[n]) for row in df[['col1', ...,'coln']].to_numpy()]
# Iterating over multiple columns - differing data type
result = [f(row[0], ..., row[n]) for row in zip(df['col1'], ..., df['coln'])]
If you can encapsulate your business logic into a function, you can use a list comprehension that calls it.如果您可以将业务逻辑封装到函数中,则可以使用调用它的列表推导。 You can make arbitrarily complex things work through the simplicity and speed of raw Python code.
您可以通过原始 Python 代码的简单性和速度使任意复杂的事情工作。
Caveats注意事项
List comprehensions assume that your data is easy to work with - what that means is your data types are consistent and you don't have NaNs, but this cannot always be guaranteed.列表推导假设您的数据易于使用 - 这意味着您的数据类型是一致的并且您没有 NaN,但这并不总是得到保证。
zip(df['A'], df['B'], ...)
instead of df[['A', 'B']].to_numpy()
as the latter implicitly upcasts data to the most common type.zip(df['A'], df['B'], ...)
而不是df[['A', 'B']].to_numpy()
作为后者隐式地将数据向上转换为最常见的类型。 As an example if A is numeric and B is string, to_numpy()
will cast the entire array to string, which may not be what you want.to_numpy()
会将整个数组转换为字符串,这可能不是您想要的。 Fortunately zip
ping your columns together is the most straightforward workaround to this.zip
在一起是最直接的解决方法。 *Your mileage may vary for the reasons outlined in the Caveats section above. *您的里程可能会因上述注意事项部分中列出的原因而有所不同。
Let's demonstrate the difference with a simple example of adding two pandas columns A + B
.让我们通过添加两个 pandas 列
A + B
的简单示例来演示差异。 This is a vectorizable operaton, so it will be easy to contrast the performance of the methods discussed above.这是一个可向量化的操作,因此很容易对比上述方法的性能。
Benchmarking code, for your reference . 基准代码,供您参考。 The line at the bottom measures a function written in numpandas, a style of Pandas that mixes heavily with NumPy to squeeze out maximum performance.
底部的行测量了一个用 numpandas 编写的函数,这是一种与 NumPy 大量混合以挤出最大性能的 Pandas 风格。 Writing numpandas code should be avoided unless you know what you're doing.
除非您知道自己在做什么,否则应避免编写 numpandas 代码。 Stick to the API where you can (ie, prefer
vec
over vec_numpy
).尽可能坚持使用 API(即,更喜欢
vec
而不是vec_numpy
)。
I should mention, however, that it isn't always this cut and dry.然而,我应该提一下,它并不总是这么干脆利落的。 Sometimes the answer to "what is the best method for an operation" is "it depends on your data".
有时,“什么是最佳操作方法”的答案是“这取决于您的数据”。 My advice is to test out different approaches on your data before settling on one.
我的建议是在确定一种方法之前对您的数据测试不同的方法。
Most of the analyses performed on the various alternatives to the iter family has been through the lens of performance.对 iter 系列的各种替代方案进行的大多数分析都是从性能的角度进行的。 However, in most situations you will typically be working on a reasonably sized dataset (nothing beyond a few thousand or 100K rows) and performance will come second to simplicity/readability of the solution.
但是,在大多数情况下,您通常会处理大小合理的数据集(不超过几千或 100K 行),性能将仅次于解决方案的简单性/可读性。
Here is my personal preference when selecting a method to use for a problem.这是我在选择用于解决问题的方法时的个人偏好。
For the novice:对于新手:
Vectorization (when possible) ;
矢量化(如果可能) ;
apply()
;apply()
; List Comprehensions;列出理解;
itertuples()
/iteritems()
;itertuples()
/iteritems()
;iterrows()
;iterrows()
; Cython赛通
For the more experienced:对于更有经验的人:
Vectorization (when possible) ;
矢量化(如果可能) ;
apply()
;apply()
; List Comprehensions;列出理解; Cython;
赛通;
itertuples()
/iteritems()
;itertuples()
/iteritems()
;iterrows()
Vectorization prevails as the most idiomatic method for any problem that can be vectorized.对于可以向量化的任何问题,向量化是最惯用的方法。 Always seek to vectorize!
始终寻求矢量化! When in doubt, consult the docs, or look on Stack Overflow for an existing question on your particular task.
如有疑问,请查阅文档,或在 Stack Overflow 上查看有关您的特定任务的现有问题。
I do tend to go on about how bad apply
is in a lot of my posts, but I do concede it is easier for a beginner to wrap their head around what it's doing.在我的很多帖子中,我确实倾向于继续谈论
apply
有多糟糕,但我确实承认,初学者更容易理解它在做什么。 Additionally, there are quite a few use cases for apply
has explained in this post of mine .此外,在我的这篇文章中解释了很多
apply
的用例。
Cython ranks lower down on the list because it takes more time and effort to pull off correctly. Cython 在列表中排名较低,因为它需要更多的时间和精力才能正确完成。 You will usually never need to write code with pandas that demands this level of performance that even a list comprehension cannot satisfy.
您通常永远不需要使用 pandas 编写需要这种性能水平的代码,即使是列表推导也无法满足。
* As with any personal opinion, please take with heaps of salt! *与任何个人意见一样,请多加盐!
10 Minutes to pandas , and Essential Basic Functionality - Useful links that introduce you to Pandas and its library of vectorized*/cythonized functions. 10 分钟了解 pandas和基本基本功能- 向您介绍 Pandas 及其矢量化*/cythonized 函数库的有用链接。
Enhancing Performance - A primer from the documentation on enhancing standard Pandas operations 增强性能- 增强标准 Pandas 操作的文档入门
Are for-loops in pandas really bad? pandas 中的 for 循环真的很糟糕吗? When should I care?
我什么时候应该关心? - a detailed writeup by me on list comprehensions and their suitability for various operations (mainly ones involving non-numeric data)
- 我对列表理解及其对各种操作的适用性(主要是涉及非数字数据的操作)的详细说明
When should I (not) want to use pandas apply() in my code? 我什么时候应该(不)想在我的代码中使用 pandas apply()? -
apply
is slow (but not as slow as the iter*
family. There are, however, situations where one can (or should) consider apply
as a serious alternative, especially in some GroupBy
operations). -
apply
很慢(但不像iter*
系列那么慢。但是,在某些情况下可以(或应该)考虑apply
作为一种重要的替代方案,尤其是在某些GroupBy
操作中)。
* Pandas string methods are "vectorized" in the sense that they are specified on the series but operate on each element. * Pandas 字符串方法是“矢量化的”,因为它们是在系列上指定的,但对每个元素都进行操作。 The underlying mechanisms are still iterative, because string operations are inherently hard to vectorize.
底层机制仍然是迭代的,因为字符串操作本质上很难向量化。
A common trend I notice from new users is to ask questions of the form "How can I iterate over my df to do X?".我从新用户那里注意到的一个常见趋势是提出“如何迭代我的 df 以执行 X?”形式的问题。 Showing code that calls
iterrows()
while doing something inside a for
loop.显示在
for
循环中执行某些操作时调用iterrows()
的代码。 Here is why.这就是为什么。 A new user to the library who has not been introduced to the concept of vectorization will likely envision the code that solves their problem as iterating over their data to do something.
一个没有被引入向量化概念的库的新用户可能会将解决他们问题的代码设想为迭代他们的数据来做某事。 Not knowing how to iterate over a DataFrame, the first thing they do is Google it and end up here, at this question.
不知道如何迭代 DataFrame,他们做的第一件事就是用谷歌搜索它,然后在这个问题上结束。 They then see the accepted answer telling them how to, and they close their eyes and run this code without ever first questioning if iteration is the right thing to do.
然后,他们看到接受的答案告诉他们如何去做,然后他们闭上眼睛运行这段代码,而不会首先质疑迭代是否是正确的做法。
The aim of this answer is to help new users understand that iteration is not necessarily the solution to every problem, and that better, faster and more idiomatic solutions could exist, and that it is worth investing time in exploring them.这个答案的目的是帮助新用户理解迭代不一定是所有问题的解决方案,并且可能存在更好、更快和更惯用的解决方案,值得花时间去探索它们。 I'm not trying to start a war of iteration vs. vectorization, but I want new users to be informed when developing solutions to their problems with this library.
我并不是要开始一场迭代与矢量化的战争,但我希望新用户在为他们的这个库的问题开发解决方案时被告知。
First consider if you really need to iterate over rows in a DataFrame.首先考虑是否真的需要遍历DataFrame 中的行。 See this answer for alternatives.
有关替代方案,请参见此答案。
If you still need to iterate over rows, you can use methods below.如果您仍然需要遍历行,可以使用下面的方法。 Note some important caveats which are not mentioned in any of the other answers.
请注意一些其他答案中未提及的重要警告。
DataFrame.iterrows() DataFrame.iterrows()
for index, row in df.iterrows(): print(row["c1"], row["c2"])
DataFrame.itertuples() DataFrame.itertuples()
for row in df.itertuples(index=True, name='Pandas'): print(row.c1, row.c2)
itertuples()
is supposed to be faster than iterrows()
itertuples()
应该比iterrows()
快
But be aware, according to the docs (pandas 0.24.2 at the moment):但请注意,根据文档(目前为 pandas 0.24.2):
dtype
might not match from row to row dtype
可能与行不匹配Because iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames).
因为 iterrows 为每一行返回一个 Series,所以它不会跨行保留dtypes(dtypes 在 DataFrames 的列中保留)。 To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the values and which is generally much faster than iterrows()
要在遍历行时保留 dtypes,最好使用 itertuples(),它返回值的命名元组,通常比 iterrows() 快得多
You should never modify something you are iterating over.
你永远不应该修改你正在迭代的东西。 This is not guaranteed to work in all cases.
这不能保证在所有情况下都有效。 Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
根据数据类型,迭代器返回一个副本而不是一个视图,写入它不会有任何效果。
Use DataFrame.apply() instead:改用DataFrame.apply() :
new_df = df.apply(lambda x: x * 2, axis = 1)
The column names will be renamed to positional names if they are invalid Python identifiers, repeated, or start with an underscore.
如果列名是无效的 Python 标识符、重复或以下划线开头,它们将被重命名为位置名称。 With a large number of columns (>255), regular tuples are returned.
对于大量列 (>255),将返回常规元组。
Seepandas docs on iteration for more details.有关更多详细信息,请参阅关于迭代的 pandas 文档。
You should use df.iterrows()
.您应该使用
df.iterrows()
。 Though iterating row-by-row is not especially efficient since Series
objects have to be created.尽管逐行迭代并不是特别有效,因为必须创建
Series
对象。
While iterrows()
is a good option, sometimes itertuples()
can be much faster:虽然
iterrows()
是一个不错的选择,但有时itertuples()
可以更快:
df = pd.DataFrame({'a': randn(1000), 'b': randn(1000),'N': randint(100, 1000, (1000)), 'x': 'x'})
%timeit [row.a * 2 for idx, row in df.iterrows()]
# => 10 loops, best of 3: 50.3 ms per loop
%timeit [row[1] * 2 for row in df.itertuples()]
# => 1000 loops, best of 3: 541 µs per loop
You can also use df.apply()
to iterate over rows and access multiple columns for a function.您还可以使用
df.apply()
遍历行并访问函数的多个列。
docs: DataFrame.apply() 文档:DataFrame.apply()
def valuation_formula(x, y):
return x * y * 0.5
df['price'] = df.apply(lambda row: valuation_formula(row['x'], row['y']), axis=1)
If you really have to iterate a Pandas dataframe, you will probably want to avoid using iterrows() .如果您真的必须迭代 Pandas 数据框,您可能希望避免使用 iterrows() 。 There are different methods and the usual
iterrows()
is far from being the best.有不同的方法,通常的
iterrows()
远不是最好的。 itertuples() can be 100 times faster. itertuples() 可以快 100 倍。
In short:简而言之:
df.itertuples(name=None)
.df.itertuples(name=None)
。 In particular, when you have a fixed number columns and less than 255 columns.df.itertuples()
except if your columns have special characters such as spaces or '-'.df.itertuples()
,除非您的列有特殊字符,例如空格或“-”。 See point (2)itertuples()
even if your dataframe has strange columns by using the last example.itertuples()
。 See point (4)iterrows()
if you cannot the previous solutions.iterrows()
。 See point (1) Generate a random dataframe with a million rows and 4 columns:生成具有一百万行和 4 列的随机数据框:
df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 4)), columns=list('ABCD'))
print(df)
1) The usual iterrows()
is convenient, but damn slow: 1) 通常的
iterrows()
很方便,但是很慢:
start_time = time.clock()
result = 0
for _, row in df.iterrows():
result += max(row['B'], row['C'])
total_elapsed_time = round(time.clock() - start_time, 2)
print("1. Iterrows done in {} seconds, result = {}".format(total_elapsed_time, result))
2) The default itertuples()
is already much faster, but it doesn't work with column names such as My Col-Name is very Strange
(you should avoid this method if your columns are repeated or if a column name cannot be simply converted to a Python variable name).: 2)默认的
itertuples()
已经快得多了,但是它不适用于列名,例如My Col-Name is very Strange
(如果您的列重复或列名不能简单地转换,则应避免使用此方法到 Python 变量名)。:
start_time = time.clock()
result = 0
for row in df.itertuples(index=False):
result += max(row.B, row.C)
total_elapsed_time = round(time.clock() - start_time, 2)
print("2. Named Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))
3) The default itertuples()
using name=None is even faster but not really convenient as you have to define a variable per column. 3) 使用 name=None 的默认
itertuples()
更快,但不是很方便,因为您必须为每列定义一个变量。
start_time = time.clock()
result = 0
for(_, col1, col2, col3, col4) in df.itertuples(name=None):
result += max(col2, col3)
total_elapsed_time = round(time.clock() - start_time, 2)
print("3. Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))
4) Finally, the named itertuples()
is slower than the previous point, but you do not have to define a variable per column and it works with column names such as My Col-Name is very Strange
. 4) 最后,命名的
itertuples()
比前一点要慢,但是您不必为每列定义一个变量,它可以与列名一起使用,例如My Col-Name is very Strange
。
start_time = time.clock()
result = 0
for row in df.itertuples(index=False):
result += max(row[df.columns.get_loc('B')], row[df.columns.get_loc('C')])
total_elapsed_time = round(time.clock() - start_time, 2)
print("4. Polyvalent Itertuples working even with special characters in the column name done in {} seconds, result = {}".format(total_elapsed_time, result))
Output:输出:
A B C D
0 41 63 42 23
1 54 9 24 65
2 15 34 10 9
3 39 94 82 97
4 4 88 79 54
... .. .. .. ..
999995 48 27 4 25
999996 16 51 34 28
999997 1 39 61 14
999998 66 51 27 70
999999 51 53 47 99
[1000000 rows x 4 columns]
1. Iterrows done in 104.96 seconds, result = 66151519
2. Named Itertuples done in 1.26 seconds, result = 66151519
3. Itertuples done in 0.94 seconds, result = 66151519
4. Polyvalent Itertuples working even with special characters in the column name done in 2.94 seconds, result = 66151519
This article is a very interesting comparison between iterrows and itertuples 这篇文章是 iterrows 和 itertuples 之间的一个非常有趣的比较
I was looking for How to iterate on rows and columns and ended here so:我一直在寻找如何迭代行和列并在这里结束:
for i, row in df.iterrows():
for j, column in row.iteritems():
print(column)
You can write your own iterator that implements namedtuple
您可以编写自己的实现
namedtuple
的迭代器
from collections import namedtuple
def myiter(d, cols=None):
if cols is None:
v = d.values.tolist()
cols = d.columns.values.tolist()
else:
j = [d.columns.get_loc(c) for c in cols]
v = d.values[:, j].tolist()
n = namedtuple('MyTuple', cols)
for line in iter(v):
yield n(*line)
This is directly comparable to pd.DataFrame.itertuples
.这与
pd.DataFrame.itertuples
直接相当。 I'm aiming at performing the same task with more efficiency.我的目标是更高效地执行相同的任务。
For the given dataframe with my function:对于具有我的功能的给定数据框:
list(myiter(df))
[MyTuple(c1=10, c2=100), MyTuple(c1=11, c2=110), MyTuple(c1=12, c2=120)]
Or with pd.DataFrame.itertuples
:或使用
pd.DataFrame.itertuples
:
list(df.itertuples(index=False))
[Pandas(c1=10, c2=100), Pandas(c1=11, c2=110), Pandas(c1=12, c2=120)]
A comprehensive test全面的测试
We test making all columns available and subsetting the columns.我们测试使所有列可用并对列进行子集化。
def iterfullA(d):
return list(myiter(d))
def iterfullB(d):
return list(d.itertuples(index=False))
def itersubA(d):
return list(myiter(d, ['col3', 'col4', 'col5', 'col6', 'col7']))
def itersubB(d):
return list(d[['col3', 'col4', 'col5', 'col6', 'col7']].itertuples(index=False))
res = pd.DataFrame(
index=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
columns='iterfullA iterfullB itersubA itersubB'.split(),
dtype=float
)
for i in res.index:
d = pd.DataFrame(np.random.randint(10, size=(i, 10))).add_prefix('col')
for j in res.columns:
stmt = '{}(d)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
res.at[i, j] = timeit(stmt, setp, number=100)
res.groupby(res.columns.str[4:-1], axis=1).plot(loglog=True);
To loop all rows in a dataframe
you can use:要循环
dataframe
的所有行,您可以使用:
for x in range(len(date_example.index)):
print date_example['Date'].iloc[x]
for ind in df.index:
print df['c1'][ind], df['c2'][ind]
We have multiple options to do the same, lots of folks have shared their answers.我们有多种选择来做同样的事情,很多人都分享了他们的答案。
I found below two methods easy and efficient to do :我发现以下两种方法既简单又有效:
Example:例子:
import pandas as pd
inp = [{'c1':10, 'c2':100}, {'c1':11,'c2':110}, {'c1':12,'c2':120}]
df = pd.DataFrame(inp)
print (df)
#With iterrows method
for index, row in df.iterrows():
print(row["c1"], row["c2"])
#With itertuples method
for row in df.itertuples(index=True, name='Pandas'):
print(row.c1, row.c2)
Note: itertuples() is supposed to be faster than iterrows()注意: itertuples() 应该比 iterrows() 快
Sometimes a useful pattern is:有时有用的模式是:
# Borrowing @KutalmisB df example
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, index=['a', 'b'])
# The to_dict call results in a list of dicts
# where each row_dict is a dictionary with k:v pairs of columns:value for that row
for row_dict in df.to_dict(orient='records'):
print(row_dict)
Which results in:结果是:
{'col1':1.0, 'col2':0.1}
{'col1':2.0, 'col2':0.2}
Update : cs95 has updated his answer to include plain numpy vectorization.更新:cs95 更新了他的答案,包括普通的 numpy 矢量化。 You can simply refer to his answer.
你可以参考他的回答。
cs95 shows that Pandas vectorization far outperforms other Pandas methods for computing stuff with dataframes. cs95 表明Pandas 矢量化在使用数据帧计算内容方面远远优于其他 Pandas 方法。
I wanted to add that if you first convert the dataframe to a NumPy array and then use vectorization, it's even faster than Pandas dataframe vectorization, (and that includes the time to turn it back into a dataframe series).我想补充一点,如果您首先将数据帧转换为 NumPy 数组,然后使用矢量化,它甚至比 Pandas 数据帧矢量化更快(这包括将其转换回数据帧系列的时间)。
If you add the following functions to cs95's benchmark code, this becomes pretty evident:如果您将以下函数添加到 cs95 的基准代码中,这将变得非常明显:
def np_vectorization(df):
np_arr = df.to_numpy()
return pd.Series(np_arr[:,0] + np_arr[:,1], index=df.index)
def just_np_vectorization(df):
np_arr = df.to_numpy()
return np_arr[:,0] + np_arr[:,1]
In short简而言之
To loop all rows in a dataframe
and use values of each row conveniently , namedtuples
can be converted to ndarray
s.为了循环数据
namedtuples
的所有行并方便地使用每行的值,可以将dataframe
转换为ndarray
。 For example:例如:
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, index=['a', 'b'])
Iterating over the rows:遍历行:
for row in df.itertuples(index=False, name='Pandas'):
print np.asarray(row)
results in:结果是:
[ 1. 0.1]
[ 2. 0.2]
Please note that if index=True
, the index is added as the first element of the tuple , which may be undesirable for some applications.请注意,如果
index=True
,则将索引添加为 tuple 的第一个元素,这对于某些应用程序可能是不可取的。
There is a way to iterate throw rows while getting a DataFrame in return, and not a Series.有一种方法可以在获取 DataFrame 而不是 Series 的同时迭代 throw 行。 I don't see anyone mentioning that you can pass index as a list for the row to be returned as a DataFrame:
我没有看到有人提到您可以将索引作为列表传递给要作为 DataFrame 返回的行:
for i in range(len(df)):
row = df.iloc[[i]]
Note the usage of double brackets.注意双括号的使用。 This returns a DataFrame with a single row.
这将返回具有单行的 DataFrame。
For both viewing and modifying values, I would use iterrows()
.对于查看和修改值,我会使用
iterrows()
。 In a for loop and by using tuple unpacking (see the example: i, row
), I use the row
for only viewing the value and use i
with the loc
method when I want to modify values.在 for 循环中并通过使用元组解包(参见示例:
i, row
),我使用row
仅查看值,并在我想修改值时将i
与loc
方法一起使用。 As stated in previous answers, here you should not modify something you are iterating over.如前面的答案所述,在这里您不应该修改您正在迭代的内容。
for i, row in df.iterrows():
df_column_A = df.loc[i, 'A']
if df_column_A == 'Old_Value':
df_column_A = 'New_value'
Here the row
in the loop is a copy of that row, and not a view of it.这里循环中的
row
是该行的副本,而不是它的视图。 Therefore, you should NOT write something like row['A'] = 'New_Value'
, it will not modify the DataFrame.因此,你不应该写类似
row['A'] = 'New_Value'
,它不会修改 DataFrame。 However, you can use i
and loc
and specify the DataFrame to do the work.但是,您可以使用
i
和loc
并指定 DataFrame 来完成这项工作。
There are so many ways to iterate over the rows in Pandas dataframe.有很多方法可以遍历 Pandas 数据框中的行。 One very simple and intuitive way is:
一种非常简单直观的方法是:
df = pd.DataFrame({'A':[1, 2, 3], 'B':[4, 5, 6], 'C':[7, 8, 9]})
print(df)
for i in range(df.shape[0]):
# For printing the second column
print(df.iloc[i, 1])
# For printing more than one columns
print(df.iloc[i, [0, 2]])
The easiest way, use the apply
function最简单的方法,使用
apply
函数
def print_row(row):
print row['c1'], row['c2']
df.apply(lambda row: print_row(row), axis=1)
As many answers here correctly and clearly point out, you should not generally attempt to loop in Pandas, but rather should write vectorized code.正如这里的许多答案正确而清楚地指出的那样,您通常不应该尝试在 Pandas 中循环,而应该编写矢量化代码。 But the question remains if you should ever write loops in Pandas, and if so the best way to loop in those situations.
但是问题仍然存在,您是否应该在 Pandas 中编写循环,以及在这些情况下循环的最佳方式。
I believe there is at least one general situation where loops are appropriate: when you need to calculate some function that depends on values in other rows in a somewhat complex manner.我相信至少在一种一般情况下循环是合适的:当您需要以某种复杂的方式计算某些依赖于其他行中的值的函数时。 In this case, the looping code is often simpler, more readable, and less error prone than vectorized code.
在这种情况下,循环代码通常比向量化代码更简单、更易读、更不容易出错。 The looping code might even be faster, too.
循环代码甚至可能更快。
I will attempt to show this with an example.我将尝试用一个例子来说明这一点。 Suppose you want to take a cumulative sum of a column, but reset it whenever some other column equals zero:
假设您想要获取一列的累积总和,但每当其他列等于零时将其重置:
import pandas as pd
import numpy as np
df = pd.DataFrame( { 'x':[1,2,3,4,5,6], 'y':[1,1,1,0,1,1] } )
# x y desired_result
#0 1 1 1
#1 2 1 3
#2 3 1 6
#3 4 0 4
#4 5 1 9
#5 6 1 15
This is a good example where you could certainly write one line of Pandas to achieve this, although it's not especially readable, especially if you aren't fairly experienced with Pandas already:这是一个很好的例子,你当然可以写一行 Pandas 来实现这一点,尽管它不是特别可读,特别是如果你对 Pandas 还没有相当的经验:
df.groupby( (df.y==0).cumsum() )['x'].cumsum()
That's going to be fast enough for most situations, although you could also write faster code by avoiding the groupby
, but it will likely be even less readable.这对于大多数情况来说已经足够快了,尽管您也可以通过避免使用
groupby
来编写更快的代码,但它的可读性可能会更低。
Alternatively, what if we write this as a loop?或者,如果我们把它写成一个循环呢? You could do something like the following with NumPy:
您可以使用 NumPy 执行以下操作:
import numba as nb
@nb.jit(nopython=True) # Optional
def custom_sum(x,y):
x_sum = x.copy()
for i in range(1,len(df)):
if y[i] > 0: x_sum[i] = x_sum[i-1] + x[i]
return x_sum
df['desired_result'] = custom_sum( df.x.to_numpy(), df.y.to_numpy() )
Admittedly, there's a bit of overhead there required to convert DataFrame columns to NumPy arrays, but the core piece of code is just one line of code that you could read even if you didn't know anything about Pandas or NumPy:诚然,将 DataFrame 列转换为 NumPy 数组需要一些开销,但核心代码只是一行代码,即使您对 Pandas 或 NumPy 一无所知,也可以阅读:
if y[i] > 0: x_sum[i] = x_sum[i-1] + x[i]
And this code is actually faster than the vectorized code.而且这段代码实际上比矢量化代码更快。 In some quick tests with 100,000 rows, the above is about 10x faster than the groupby approach.
在一些 100,000 行的快速测试中,上述方法比groupby方法快 10 倍左右。 Note that one key to the speed there is numba, which is optional.
请注意,速度的一键是numba,这是可选的。 Without the "@nb.jit" line, the looping code is actually about 10x slower than the groupby approach.
如果没有“@nb.jit”行,循环代码实际上比groupby方法慢 10 倍。
Clearly this example is simple enough that you would likely prefer the one line of pandas to writing a loop with its associated overhead.很明显,这个例子很简单,你可能更喜欢用 pandas 的一行来编写一个带有相关开销的循环。 However, there are more complex versions of this problem for which the readability or speed of the NumPy/numba loop approach likely makes sense.
然而,这个问题有更复杂的版本,NumPy/numba 循环方法的可读性或速度可能是有意义的。
df.iterrows()
返回tuple(a, b)
其中a
是index
, b
是row
。
You can also do NumPy indexing for even greater speed ups.您还可以执行 NumPy 索引以获得更大的速度提升。 It's not really iterating but works much better than iteration for certain applications.
它并不是真正的迭代,但对于某些应用程序来说比迭代好得多。
subset = row['c1'][0:5]
all = row['c1'][:]
You may also want to cast it to an array.您可能还想将其转换为数组。 These indexes/selections are supposed to act like NumPy arrays already, but I ran into issues and needed to cast
这些索引/选择应该已经像 NumPy 数组一样,但是我遇到了问题并且需要强制转换
np.asarray(all)
imgs[:] = cv2.resize(imgs[:], (224,224) ) # Resize every image in an hdf5 file
Disclaimer: Although here are so many answers which recommend not using an iterative (loop) approach (and I mostly agree), I would still see it as a reasonable approach for the following situation:免责声明:虽然这里有很多答案建议不要使用迭代(循环)方法(我大多同意),但对于以下情况,我仍然认为它是一种合理的方法:
Let's say you have a large dataframe which contains incomplete user data.假设您有一个包含不完整用户数据的大型数据框。 Now you have to extend this data with additional columns, for example the user's
age
and gender
.现在您必须使用其他列扩展此数据,例如用户的
age
和gender
。
Both values have to be fetched from a backend API.这两个值都必须从后端 API 中获取。 I'm assuming the API doesn't provide a "batch" endpoint (which would accept multiple user IDs at once).
我假设 API 不提供“批处理”端点(可以一次接受多个用户 ID)。 Otherwise, you should rather call the API only once.
否则,您应该只调用一次 API。
The costs (waiting time) for the network request surpass the iteration of the dataframe by far.网络请求的成本(等待时间)远远超过了数据帧的迭代。 We're talking about network roundtrip times of hundreds of milliseconds compared to the negligibly small gains in using alternative approaches to iterations.
我们谈论的是数百毫秒的网络往返时间,而使用替代迭代方法的收益微不足道。
So in this case, I would absolutely prefer using an iterative approach.所以在这种情况下,我绝对更喜欢使用迭代方法。 Although the network request is expensive, it is guaranteed being triggered only once for each row in the dataframe.
尽管网络请求很昂贵,但可以保证数据帧中的每一行只触发一次。 Here is an example using DataFrame.iterrows :
这是使用DataFrame.iterrows的示例:
for index, row in users_df.iterrows():
user_id = row['user_id']
# trigger expensive network request once for each row
response_dict = backend_api.get(f'/api/user-data/{user_id}')
# extend dataframe with multiple data from response
users_df.at[index, 'age'] = response_dict.get('age')
users_df.at[index, 'gender'] = response_dict.get('gender')
This example uses iloc to isolate each digit in the data frame.此示例使用 iloc 隔离数据帧中的每个数字。
import pandas as pd
a = [1, 2, 3, 4]
b = [5, 6, 7, 8]
mjr = pd.DataFrame({'a':a, 'b':b})
size = mjr.shape
for i in range(size[0]):
for j in range(size[1]):
print(mjr.iloc[i, j])
Some libraries (eg a Java interop library that I use) require values to be passed in a row at a time, for example, if streaming data.某些库(例如,我使用的 Java 互操作库)需要一次连续传递值,例如,如果是流数据。 To replicate the streaming nature, I 'stream' my dataframe values one by one, I wrote the below, which comes in handy from time to time.
为了复制流式传输的性质,我将我的数据帧值逐个“流式传输”,我写了以下内容,它不时派上用场。
class DataFrameReader:
def __init__(self, df):
self._df = df
self._row = None
self._columns = df.columns.tolist()
self.reset()
self.row_index = 0
def __getattr__(self, key):
return self.__getitem__(key)
def read(self) -> bool:
self._row = next(self._iterator, None)
self.row_index += 1
return self._row is not None
def columns(self):
return self._columns
def reset(self) -> None:
self._iterator = self._df.itertuples()
def get_index(self):
return self._row[0]
def index(self):
return self._row[0]
def to_dict(self, columns: List[str] = None):
return self.row(columns=columns)
def tolist(self, cols) -> List[object]:
return [self.__getitem__(c) for c in cols]
def row(self, columns: List[str] = None) -> Dict[str, object]:
cols = set(self._columns if columns is None else columns)
return {c : self.__getitem__(c) for c in self._columns if c in cols}
def __getitem__(self, key) -> object:
# the df index of the row is at index 0
try:
if type(key) is list:
ix = [self._columns.index(key) + 1 for k in key]
else:
ix = self._columns.index(key) + 1
return self._row[ix]
except BaseException as e:
return None
def __next__(self) -> 'DataFrameReader':
if self.read():
return self
else:
raise StopIteration
def __iter__(self) -> 'DataFrameReader':
return self
Which can be used:可以使用哪个:
for row in DataFrameReader(df):
print(row.my_column_name)
print(row.to_dict())
print(row['my_column_name'])
print(row.tolist())
And preserves the values/ name mapping for the rows being iterated.并保留正在迭代的行的值/名称映射。 Obviously, is a lot slower than using apply and Cython as indicated above, but is necessary in some circumstances.
显然,这比使用上述的 apply 和 Cython 慢很多,但在某些情况下是必要的。
Along with the great answers in this post I am going to propose Divide and Conquer approach, I am not writing this answer to abolish the other great answers but to fulfill them with another approach which was working efficiently for me.除了这篇文章中的好答案,我将提出分而治之的方法,我写这个答案并不是为了废除其他好的答案,而是用另一种对我有效的方法来实现它们。 It has two steps of
splitting
and merging
the pandas dataframe:它有
splitting
和merging
pandas 数据框的两个步骤:
PROS of Divide and Conquer:分而治之的优点:
iterrows()
and itertuples()
in my case were having the same performance over entire dataframeiterrows()
和itertuples()
在整个数据帧上都具有相同的性能index
, you will be able to exponentially quicken the iteration.index
,您将能够以指数方式加快迭代。 The higher index
, the quicker your iteration process. index
越高,迭代过程越快。 CONS of Divide and Conquer:分而治之的缺点:
=================== Divide and Conquer Approach ================= =================== 分而治之=================
Step 1: Splitting/Slicing第 1 步:分割/切片
In this step, we are going to divide the iteration over the entire dataframe.在这一步中,我们将在整个数据帧上划分迭代。 Think that you are going to read a csv file into pandas df then iterate over it.
认为您要将 csv 文件读入 pandas df 然后对其进行迭代。 In may case I have 5,000,000 records and I am going to split it into 100,000 records.
在可能的情况下,我有 5,000,000 条记录,我将把它分成 100,000 条记录。
NOTE: I need to reiterate as other runtime analysis explained in the other solutions in this page, "number of records" has exponential proportion of "runtime" on search on the df.注意:我需要重申本页其他解决方案中解释的其他运行时分析,“记录数”在 df 上搜索时具有“运行时”的指数比例。 Based on the benchmark on my data here are the results:
根据我的数据的基准,这里是结果:
Number of records | Iteration per second
========================================
100,000 | 500 it/s
500,000 | 200 it/s
1,000,000 | 50 it/s
5,000,000 | 20 it/s
Step 2: Merging第 2 步:合并
This is going to be an easy step, just merge all the written csv files into one dataframe and write it into a bigger csv file.这将是一个简单的步骤,只需将所有写入的 csv 文件合并到一个数据帧中,然后将其写入一个更大的 csv 文件。
Here is the sample code:这是示例代码:
# Step 1 (Splitting/Slicing)
import pandas as pd
df_all = pd.read_csv('C:/KtV.csv')
df_index = 100000
df_len = len(df)
for i in range(df_len // df_index + 1):
lower_bound = i * df_index
higher_bound = min(lower_bound + df_index, df_len)
# splitting/slicing df (make sure to copy() otherwise it will be a view
df = df_all[lower_bound:higher_bound].copy()
'''
write your iteration over the sliced df here
using iterrows() or intertuples() or ...
'''
# writing into csv files
df.to_csv('C:/KtV_prep_'+str(i)+'.csv')
# Step 2 (Merging)
filename='C:/KtV_prep_'
df = (pd.read_csv(f) for f in [filename+str(i)+'.csv' for i in range(ktv_len // ktv_index + 1)])
df_prep_all = pd.concat(df)
df_prep_all.to_csv('C:/KtV_prep_all.csv')
Reference:参考:
Efficient way of iteration over datafreame 对数据帧进行迭代的有效方法
Concatenate csv files into one Pandas Dataframe 将 csv 文件连接成一个 Pandas 数据框
As the accepted answer states, the fastest way to apply a function over rows is to use a vectorized function , the so-called NumPy ufuncs
(universal functions).正如公认的答案所述,在行上应用函数的最快方法是使用向量化函数,即所谓的 NumPy
ufuncs
(通用函数)。
But what should you do when the function you want to apply isn't already implemented in NumPy?但是当你想要应用的功能还没有在 NumPy 中实现时你应该怎么做呢?
Well, using the vectorize
decorator from numba
, you can easily create ufuncs directly in Python like this:好吧,使用来自
numba
的vectorize
装饰器,您可以像这样轻松地直接在 Python 中创建 ufunc:
from numba import vectorize, float64
@vectorize([float64(float64)])
def f(x):
#x is your line, do something with it, and return a float
The documentation for this function is here: Creating NumPy universal functions这个函数的文档在这里: 创建 NumPy 通用函数
Probably the most elegant solution (but certainly not the most efficient):可能是最优雅的解决方案(但肯定不是最有效的):
for row in df.values:
c2 = row[1]
print(row)
# ...
for c1, c2 in df.values:
# ...
Note that:注意:
.to_numpy()
instead .to_numpy()
object
object
Still, I think this option should be included here, as a straight-forward solution to a (one should think) trivial problem.尽管如此,我认为这个选项应该包括在这里,作为一个(人们应该认为的)微不足道的问题的直接解决方案。
A better way is to convert the dataframe into a dictionary by using zip , creating a key value pairing, and then access the row values by key.更好的方法是使用zip将数据帧转换为字典,创建键值对,然后通过键访问行值。
My answer shows how to use a dictionary as an alternative to Pandas.我的回答显示了如何使用字典作为 Pandas 的替代品。 Some people think dictionaries and tuples are more efficient.
有些人认为字典和元组更有效。 You can easily replace the dictionary with a namedtuple list.
您可以轻松地将字典替换为命名元组列表。
inp = [{'c1':10, 'c2':100}, {'c1':11, 'c2':110}, {'c1':12, 'c2':120}]
df = pd.DataFrame(inp)
print(df)
for row in inp:
for (k, v) in zip(row.keys(), row.values()):
print(k, v)
Output:输出:
c1 10
c2 100
c1 11
c2 110
c1 12
c2 120
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.