I have a dataset like so:
pd.DataFrame({'Type.1': ['ES','STR','RRH','ES','STR','STR','STR'],
'Type.2': ['ES','STR','ES','ES','STR','STR','ES'],
'Type.3': ['ES','ES','STR','STR','ES','ES','ES'],
'Type.4': ['ES','ES','STR','STR','ES','ES','ES']})
I'm hoping to append columns to the dataset below which contain the count of that specific element (which I've been able to create using COUNTIF in Excel like so).
+--------+--------+--------+--------+----------+-----------+-----------+
| Type.1 | Type.2 | Type.3 | Type.4 | ES_count | STR_count | RRH_count |
+--------+--------+--------+--------+----------+-----------+-----------+
| ES | ES | ES | ES | 4 | 0 | 0 |
| STR | STR | ES | ES | 2 | 2 | 0 |
| RRH | ES | STR | STR | 1 | 2 | 1 |
| ES | ES | STR | STR | 2 | 2 | 0 |
| STR | STR | ES | ES | 2 | 2 | 0 |
| STR | STR | ES | ES | 2 | 2 | 0 |
| STR | ES | ES | ES | 3 | 1 | 0 |
+--------+--------+--------+--------+----------+-----------+-----------+
What would be the best method to do this in Python? I think it would look something like this? But isn't working.
for i in range(8):
def function(row):
if row[f"Type.{i-1}"] == 'ES':
row['ES'] = row['ES'] + 1
elif row[f"Type.{i-1}"] == 'RRH':
row['RRH'] = row['RRH'] + 1
elif row[f"Type.{i-1}"] == 'STR':
row['STR'] = row['STR'] + 1
elif row[f"Type.{i-1}"] == 'PSH':
row['PSH'] = row['PSH'] + 1
elif row[f"Type.{i-1}"] == 'TH':
row['TH'] = row['TH'] + 1
df = df.apply(function, axis=1)
Thank you!!
Here's another option:
df_out = pd.get_dummies(df, prefix='', prefix_sep='')
df_out = df_out.groupby(df_out.columns, axis=1).sum().add_suffix('_count')
df.join(df_out)
Output:
Type.1 Type.2 Type.3 Type.4 ES_count RRH_count STR_count
0 ES ES ES ES 4 0 0
1 STR STR ES ES 2 0 2
2 RRH ES STR STR 1 1 2
3 ES ES STR STR 2 0 2
4 STR STR ES ES 2 0 2
5 STR STR ES ES 2 0 2
6 STR ES ES ES 3 0 1
%%timeit
df_out = pd.get_dummies(df, prefix='', prefix_sep='')
df_out = df_out.groupby(df_out.columns, axis=1).sum().add_suffix('_count')
df.join(df_out)
6.98 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df2 = df.apply(pd.Series.value_counts, axis=1)
df_out = pd.concat([df,df2],axis=1).fillna(0)
df_out
9.51 ms ± 403 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
from timeit import timeit
df = pd.DataFrame({'Type.1': ['ES','STR','RRH','ES','STR','STR','STR'],
'Type.2': ['ES','STR','ES','ES','STR','STR','ES'],
'Type.3': ['ES','ES','STR','STR','ES','ES','ES'],
'Type.4': ['ES','ES','STR','STR','ES','ES','ES']})
def getdummy(d):
df_out = pd.get_dummies(d, prefix='', prefix_sep='')
df_out = df_out.groupby(df_out.columns, axis=1).sum().add_suffix('_count')
return pd.concat([d, df_out], axis=1)
def applyvc(d):
df2 = d.apply(pd.Series.value_counts, axis=1)
return pd.concat([d,df2],axis=1).fillna(0)
res = pd.DataFrame(
index=[10, 30, 100, 300, 1000],
columns='getdummy applyvc'.split(),
dtype=float
)
for i in res.index:
d = pd.concat([df]*i).add_prefix('col')
for j in res.columns:
stmt = '{}(d)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
print(stmt, d.shape)
res.at[i, j] = timeit(stmt, setp, number=100)
# res.groupby(res.columns.str[4:-1], axis=1).plot(loglog=True);
res.plot(loglog=True);
The code below should work. It creates another dataframe with the count of occurrences, and then concatenates them together.
df2 = df.apply(pd.Series.value_counts, axis=1)
df = pd.concat([df,df2],axis=1).fillna(0)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.