I have a dataset that's an identifier ID and some flags for characteristics in that data, for example:
In [86]: frame = pd.DataFrame({"key": [1,2,3,4,5,6,7,8,9], "flag1": [0,1,0,1,0,1,0,1,1], "flag2": [0,0,1,1,0,0,1,1,0], "flag3": [0,0,0,0,1,1,1,1,1]}, columns=['key','flag1','flag2','flag3'])
In [87]: frame
Out[87]:
key flag1 flag2 flag3
0 1 0 0 0
1 2 1 0 0
2 3 0 1 0
3 4 1 1 0
4 5 0 0 1
5 6 1 0 1
6 7 0 1 1
7 8 1 1 1
8 9 1 0 1
I'm looking to output a dataset that provides me counts of whenever both of the flags are met as a pivot table, for example:
flags flag1 flag2 flag3
0 flag1 5 2 3
1 flag2 2 4 2
2 flag3 3 2 5
I think I'll have to iterate over frame.keys()[1:]
on two loops, but I don't know how to populate this second dataset. I'm should imitate behavior from this Google Sheet, but my actual dataset is too large for Sheets/Excel to be useable (about 2 million rows and 60 columns): https://docs.google.com/spreadsheets/d/1emEm9RtxPAFceUgalCVbzr0mGNoZEMFjWwqSjrxyAuE/edit?usp=sharing
Let's remove key
, we don't need it. After that, the solution is pretty much a matrix dot
product:
v = frame.drop('key', 1)
v.T.dot(v)
flag1 flag2 flag3
flag1 5 2 3
flag2 2 4 2
flag3 3 2 5
Or, more efficiently, using del
to drop the key
column:
del frame['key']
frame.T.dot(frame)
flag1 flag2 flag3
flag1 5 2 3
flag2 2 4 2
flag3 3 2 5
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.