I've been trying to turn this
| row_id | col_id |
|--------|--------|
| 1 | 23 |
| 4 | 45 |
| ... | ... |
| 1 | 23 |
| ... | ... |
| 4 | 45 |
| ... | ... |
| 4 | 45 |
| ... | ... |
Into this
| row_id | col_id | count |
|--------|--------|---------|
| 1 | 23 | 2 |
| 4 | 45 | 3 |
| ... | ... | ... |
So all (row_i, col_j) occurrences are added into the 'count' column. Note that row_id and column_id won't be unique in any of both cases.
Now success until now, at least if I want to keep being efficient. I can iterate over each pair and add up occurrences, but there has to be a simpler way in pandas—or numpy for that matter.
Thanks!
EDIT 1:
As @j-bradley suggested, I tried the following
# I use django-pandas
rdf = Record.objects.to_dataframe(['row_id', 'column_id'])
_ = rdf.groupby(['row_id', 'column_id'])['row_id'].count().head(20)
_.head(10)
And that outputs
row_id column_id
1 108 1
168 1
218 1
398 2
422 1
10 35 2
355 1
489 1
100 352 1
366 1
Name: row_id, dtype: int64
This seems ok. But it's a Series object and I'm not sure how to turn this into a dataframe with the required three columns. Pandas noob, as it seems. Any tips?
Thanks again.
you can group by columns a and b and call count
on the group by object:
df =pd.DataFrame({'A':[1,4,1,4,4], 'B':[23,45,23,45,45]})
df.groupby(['A','B'])['A'].count()
returns:
A B
1 23 2
4 45 3
Edited to make the answer more explicit
To turn the series
back to a dataframe
with a column named count:
_ = df.groupby(['A','B'])['A'].count()
the name of the series becomes the column name:
_.name = 'Count'
resetting the index, promotes the multi-index to columns and turns the series into a dataframe:
df =_.reset_index()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.