简体   繁体   中英

How to count changing values across various columns - Pandas Dataframe

I am trying to look at a specific feature over time for different unique ids, and have it stored in a dataframe in Pandas.

Here is an example with code to replicate:

d = {'id': ['adam', 'john'],'t1': ['A', 'A'], 't2': ['A', 'B'], 't3': ['A', 'B'], 't4': ['B', 'A']}
df = pd.DataFrame(data=d)
df

     id t1 t2 t3 t4
0  adam  A  A  A  B
1  john  A  B  B  A

I would like to count the number of times the t* value changes for a specific id. Example:

John value starts at A and moves to B (one change), then stays at B (no change), then moves to A in t4 (second change), so there were two total changes.

Expected output is as follows:

     id t1 t2 t3 t4  toatal_change
0  adam  A  A  A  B              1
1  john  A  B  B  A              2
s = df[df.columns[1:]]
df.assign(total_change=s.ne(s.shift(axis=1).bfill(1)).sum(1))

Output:

     id t1 t2 t3 t4  total_change
0  adam  A  A  A  B             1
1  john  A  B  B  A             2

This will be slower than it's numpy equivalent:

df = pd.concat([df]*10000)
s = df[df.columns[1:]]
v = df.filter(regex='^t\d+').values

%%timeit
df.assign(total_change=s.ne(s.shift(axis=1).bfill(1)).sum(1))    
21.2 ms ± 256 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
df.assign(total_change=(v[:, 1:] != v[:, :-1]).sum(1))
1.9 ms ± 8.53 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Same idea but using Numpy

v = df.filter(regex='^t\d+').values
df.assign(total_change=(v[:, 1:] != v[:, :-1]).sum(1))

     id t1 t2 t3 t4  total_change
0  adam  A  A  A  B             1
1  john  A  B  B  A             2

You can use:

df.merge((df.set_index('id').shift(1,axis=1).bfill(1) != df.set_index('id')).sum(1)
            .rename('total_change')
            .to_frame(), 
          left_on='id', 
          right_index=True)

Output:

     id t1 t2 t3 t4  total_change
0  adam  A  A  A  B             1
1  john  A  B  B  A             2

you could filter the columns and go over all the values in the row and check for change:

columns_needed =  [col for col in df.columns.values if col.startswith('t')]
df['toatal_change'] = df[columns_needed].apply(lambda row: sum([1 for val, val2 
                                          in zip(row, row[1:]) if val != val2]),axis=1)

which results in:

     id t1 t2 t3 t4  toatal_change
0  adam  A  A  A  B              1
1  john  A  B  B  A              2

the lambda expression in apply is equivalent to:

def check_chage(row):
    is_eq_next_val = [1 for val, val2 in zip(row, row[1:]) if val != val2]
    return sum(is_eq_next_val)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM