简体   繁体   中英

Python Pandas Dataframe - row level operations

I need to do a large number of row level operations (a few pages of code) on a table of data.

Eg if row.Col_A == 'X': row.Col_B = 'Y'

I believe iterrows isn't appropriate for altering table values. So I've converted the table to a list of DotMap dictionaries. With this I can loop over the list and for each dictionary (row), write the code as above and the alterations are saved.

Is it possible to do this with the data as a DataFrame?

There is a lot of logic and I think its clearest written this way so I'd prefer not to use map or apply functions.

Let's have the following example dataframe:

import pandas as pd
import numpy as np

some_data = pd.DataFrame({
    'col_a': [1, 2, 1, 2, 3, 4, 3, 4],
    'col_b': ['a', 'b', 'c', 'c', 'a', 'b', 'z', 'z']
})

We want to create a new column based on one (or more) of the existing columns' values.

In case you have only two options, I would suggest using numpy.where like this:

some_data['np_where_example'] = np.where(some_data.col_a < 3, 'less_than_3', 'greater_than_3')
print(some_data)
>>>
   col_a col_b           col_c map_example np_where_example  \
0      1     a     less_than_3         NaN      less_than_3   
1      2     b     less_than_3         BBB      less_than_3   
2      1     c     less_than_3         NaN      less_than_3   
3      2     c     less_than_3         NaN      less_than_3   
4      3     a  greater_than_3         NaN   greater_than_3   
5      4     b  greater_than_3         BBB   greater_than_3   
6      3     z  greater_than_3         ZZZ   greater_than_3   
7      4     z  greater_than_3         ZZZ   greater_than_3 

# multiple conditions
some_data['np_where_multiple_conditions'] = np.where(((some_data.col_a >= 3) & (some_data.col_b == 'z')),
                                                     'is_true',
                                                     'is_false')
print(some_data)
>>>
   col_a col_b np_where_multiple_conditions
0      1     a                     is_false
1      2     b                     is_false
2      1     c                     is_false
3      2     c                     is_false
4      3     a                     is_false
5      4     b                     is_false
6      3     z                      is_true
7      4     z                      is_true

In case you have many options, then pandas.map would be better:

some_data['map_example'] = some_data.col_b.map({
    'b': 'BBB',
    'z': 'ZZZ'
})
print(some_data)
>>>
   col_a col_b map_example
0      1     a         NaN
1      2     b         BBB
2      1     c         NaN
3      2     c         NaN
4      3     a         NaN
5      4     b         BBB
6      3     z         ZZZ
7      4     z         ZZZ

As you see, in all cases the values for which a condition is not specified evaluate to NaN .

You can use the apply function with a lambda in the following way:

df['Col_B'] = df['Col_A'].apply(lambda a: 'Y' if a == 'X' else 'N')

This creates the column Col_B on the dataframe df by looking at Col_A and giving either the values 'Y' if Col_A is 'X' and 'N' otherwise.

if your function is a bit more complex you can define it beforehand and call it in the apply function as follows:

def yes_or_no(x):
    if x == 'X':
        return 'Y'
    else:
        return 'N'
df['Col_B'] = df['Col_A'].apply(lambda a: yes_or_no(a))

A possible way to iterate over a dataframe by rows and change column values is:

  1. make sure that there are no duplicated values in index (if there are, just use reset_index to get an acceptable index)

  2. iterate over the index and access the individual values with at

     for ix in df.index: if df.at[ix, 'A'] ==...: df.at[ix, 'B'] = z

Alternatively, if you can access the columns by their positions instead of their names, you can use the even more efficient iat :

for i in range(len(df)):
    if df.iat[i, index_col_A] == ... :
        df.iat[i, index_col_B] = z

As you access directly the individual elements, you avoid the overhead of iterrows creating a Series per row, and can perform changes. AFAIK, it is the less bad way when you cannot use the vectorized Pandas or numpy methods.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM