Python Pandas Dataframe - 行级操作

Question

I need to do a large number of row level operations (a few pages of code) on a table of data.我需要对一张数据表进行大量的行级操作（几页代码）。

Eg if row.Col_A == 'X': row.Col_B = 'Y'例如if row.Col_A == 'X': row.Col_B = 'Y'

I believe iterrows isn't appropriate for altering table values.我相信 iterrows 不适合更改表值。 So I've converted the table to a list of DotMap dictionaries.因此，我已将表格转换为 DotMap 字典列表。 With this I can loop over the list and for each dictionary (row), write the code as above and the alterations are saved.有了这个，我可以遍历列表并为每个字典（行）编写上面的代码并保存更改。

Is it possible to do this with the data as a DataFrame?是否可以将数据作为 DataFrame 执行此操作？

There is a lot of logic and I think its clearest written this way so I'd prefer not to use map or apply functions.有很多逻辑，我认为这样写最清楚，所以我不想使用 map 或应用函数。

Answer 1

Let's have the following example dataframe:让我们有以下示例 dataframe：

import pandas as pd
import numpy as np

some_data = pd.DataFrame({
    'col_a': [1, 2, 1, 2, 3, 4, 3, 4],
    'col_b': ['a', 'b', 'c', 'c', 'a', 'b', 'z', 'z']
})

We want to create a new column based on one (or more) of the existing columns' values.我们希望基于一个（或多个）现有列的值创建一个新列。

In case you have only two options, I would suggest using numpy.where like this:如果您只有两个选项，我建议您使用 numpy.where 像这样：

some_data['np_where_example'] = np.where(some_data.col_a < 3, 'less_than_3', 'greater_than_3')
print(some_data)
>>>
   col_a col_b           col_c map_example np_where_example  \
0      1     a     less_than_3         NaN      less_than_3   
1      2     b     less_than_3         BBB      less_than_3   
2      1     c     less_than_3         NaN      less_than_3   
3      2     c     less_than_3         NaN      less_than_3   
4      3     a  greater_than_3         NaN   greater_than_3   
5      4     b  greater_than_3         BBB   greater_than_3   
6      3     z  greater_than_3         ZZZ   greater_than_3   
7      4     z  greater_than_3         ZZZ   greater_than_3 

# multiple conditions
some_data['np_where_multiple_conditions'] = np.where(((some_data.col_a >= 3) & (some_data.col_b == 'z')),
                                                     'is_true',
                                                     'is_false')
print(some_data)
>>>
   col_a col_b np_where_multiple_conditions
0      1     a                     is_false
1      2     b                     is_false
2      1     c                     is_false
3      2     c                     is_false
4      3     a                     is_false
5      4     b                     is_false
6      3     z                      is_true
7      4     z                      is_true

In case you have many options, then pandas.map would be better:如果您有很多选择，那么 pandas.map 会更好：

some_data['map_example'] = some_data.col_b.map({
    'b': 'BBB',
    'z': 'ZZZ'
})
print(some_data)
>>>
   col_a col_b map_example
0      1     a         NaN
1      2     b         BBB
2      1     c         NaN
3      2     c         NaN
4      3     a         NaN
5      4     b         BBB
6      3     z         ZZZ
7      4     z         ZZZ

As you see, in all cases the values for which a condition is not specified evaluate to NaN .如您所见，在所有情况下，未指定条件的值的计算结果为NaN 。

Answer 2

You can use the apply function with a lambda in the following way:您可以通过以下方式将应用 function 与 lambda 一起使用：

df['Col_B'] = df['Col_A'].apply(lambda a: 'Y' if a == 'X' else 'N')

This creates the column Col_B on the dataframe df by looking at Col_A and giving either the values 'Y' if Col_A is 'X' and 'N' otherwise.这会在 dataframe df 上创建列 Col_B，方法是查看 Col_A 并在 Col_A 为“X”时给出值“Y”，否则为“N”。

if your function is a bit more complex you can define it beforehand and call it in the apply function as follows:如果您的 function 有点复杂，您可以预先定义它并在应用 function 中调用它，如下所示：

def yes_or_no(x):
    if x == 'X':
        return 'Y'
    else:
        return 'N'
df['Col_B'] = df['Col_A'].apply(lambda a: yes_or_no(a))

Answer 3

A possible way to iterate over a dataframe by rows and change column values is:按行迭代 dataframe 并更改列值的一种可能方法是：

make sure that there are no duplicated values in index (if there are, just use reset_index to get an acceptable index)确保索引中没有重复的值（如果有，只需使用reset_index获取可接受的索引）
iterate over the index and access the individual values with at遍历索引并使用at访问各个值
```
 for ix in df.index: if df.at[ix, 'A'] ==...: df.at[ix, 'B'] = z
```

Alternatively, if you can access the columns by their positions instead of their names, you can use the even more efficient iat :或者，如果您可以通过它们的位置而不是它们的名称来访问列，则可以使用更有效的iat ：

for i in range(len(df)):
    if df.iat[i, index_col_A] == ... :
        df.iat[i, index_col_B] = z

As you access directly the individual elements, you avoid the overhead of iterrows creating a Series per row, and can perform changes.当您直接访问单个元素时，您可以避免每行创建一个系列的iterrows开销，并且可以执行更改。 AFAIK, it is the less bad way when you cannot use the vectorized Pandas or numpy methods. AFAIK，当您不能使用矢量化 Pandas 或 numpy 方法时，这是一种不太糟糕的方法。

Python Pandas Dataframe - 行级操作

问题描述

3 个解决方案

解决方案1
1 2020-06-24 06:33:07

解决方案2
0 2020-06-24 06:14:42

解决方案3
0 2020-06-24 06:29:56

Python Pandas Dataframe - 行级操作

问题描述

3 个解决方案

解决方案1 1 2020-06-24 06:33:07

解决方案2 0 2020-06-24 06:14:42

解决方案3 0 2020-06-24 06:29:56

解决方案1
1 2020-06-24 06:33:07

解决方案2
0 2020-06-24 06:14:42

解决方案3
0 2020-06-24 06:29:56