简体   繁体   中英

Create new columns with counts of specific values across all columns (e.g. similar to COUNTIF)?

I have a dataset like so:

pd.DataFrame({'Type.1': ['ES','STR','RRH','ES','STR','STR','STR'],
              'Type.2': ['ES','STR','ES','ES','STR','STR','ES'],
              'Type.3': ['ES','ES','STR','STR','ES','ES','ES'],
              'Type.4': ['ES','ES','STR','STR','ES','ES','ES']})

I'm hoping to append columns to the dataset below which contain the count of that specific element (which I've been able to create using COUNTIF in Excel like so).


+--------+--------+--------+--------+----------+-----------+-----------+
| Type.1 | Type.2 | Type.3 | Type.4 | ES_count | STR_count | RRH_count |
+--------+--------+--------+--------+----------+-----------+-----------+
| ES     | ES     | ES     | ES     |        4 |         0 |         0 |
| STR    | STR    | ES     | ES     |        2 |         2 |         0 |
| RRH    | ES     | STR    | STR    |        1 |         2 |         1 |
| ES     | ES     | STR    | STR    |        2 |         2 |         0 |
| STR    | STR    | ES     | ES     |        2 |         2 |         0 |
| STR    | STR    | ES     | ES     |        2 |         2 |         0 |
| STR    | ES     | ES     | ES     |        3 |         1 |         0 |
+--------+--------+--------+--------+----------+-----------+-----------+

What would be the best method to do this in Python? I think it would look something like this? But isn't working.

for i in range(8):
    def function(row):
        if row[f"Type.{i-1}"] == 'ES':
            row['ES'] = row['ES'] + 1
        elif row[f"Type.{i-1}"] == 'RRH':
            row['RRH'] = row['RRH'] + 1
        elif row[f"Type.{i-1}"] == 'STR':
            row['STR'] = row['STR'] + 1
        elif row[f"Type.{i-1}"] == 'PSH':
            row['PSH'] = row['PSH'] + 1
        elif row[f"Type.{i-1}"] == 'TH':
            row['TH'] = row['TH'] + 1

df = df.apply(function, axis=1)  

Thank you!!

Here's another option:

df_out = pd.get_dummies(df, prefix='', prefix_sep='')
df_out = df_out.groupby(df_out.columns, axis=1).sum().add_suffix('_count')
df.join(df_out)

Output:

  Type.1 Type.2 Type.3 Type.4  ES_count  RRH_count  STR_count
0     ES     ES     ES     ES         4          0          0
1    STR    STR     ES     ES         2          0          2
2    RRH     ES    STR    STR         1          1          2
3     ES     ES    STR    STR         2          0          2
4    STR    STR     ES     ES         2          0          2
5    STR    STR     ES     ES         2          0          2
6    STR     ES     ES     ES         3          0          1

Timings:

%%timeit
df_out = pd.get_dummies(df, prefix='', prefix_sep='')
df_out = df_out.groupby(df_out.columns, axis=1).sum().add_suffix('_count')
df.join(df_out)

6.98 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
df2 = df.apply(pd.Series.value_counts, axis=1)
df_out = pd.concat([df,df2],axis=1).fillna(0)
df_out

9.51 ms ± 403 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Timing on larger dataframe

from timeit import timeit

df  = pd.DataFrame({'Type.1': ['ES','STR','RRH','ES','STR','STR','STR'],
              'Type.2': ['ES','STR','ES','ES','STR','STR','ES'],
              'Type.3': ['ES','ES','STR','STR','ES','ES','ES'],
              'Type.4': ['ES','ES','STR','STR','ES','ES','ES']})

def getdummy(d):
    df_out = pd.get_dummies(d, prefix='', prefix_sep='')
    df_out = df_out.groupby(df_out.columns, axis=1).sum().add_suffix('_count')
    return pd.concat([d, df_out], axis=1)
    
def applyvc(d):
    df2 = d.apply(pd.Series.value_counts, axis=1)
    return pd.concat([d,df2],axis=1).fillna(0)

res = pd.DataFrame(
    index=[10, 30, 100, 300, 1000],
    columns='getdummy applyvc'.split(),
    dtype=float
)

for i in res.index:
    d = pd.concat([df]*i).add_prefix('col')
    for j in res.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        print(stmt, d.shape)
        res.at[i, j] = timeit(stmt, setp, number=100)

# res.groupby(res.columns.str[4:-1], axis=1).plot(loglog=True);
res.plot(loglog=True);

在此处输入图像描述

The code below should work. It creates another dataframe with the count of occurrences, and then concatenates them together.

df2 = df.apply(pd.Series.value_counts, axis=1)
df = pd.concat([df,df2],axis=1).fillna(0)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM