[英]How can I make a seaborn plot in python with displot where we count unique values in one field rather than the total number of rows?
I have a dataframe that contains about 60,000 rows.我有一个包含大约 60,000 行的 dataframe。 All 60,000 of them have unique record identifiers, but they also have separate sessionIDs, of which about 12,000 are unique.
所有 60,000 个都有唯一的记录标识符,但它们也有单独的 sessionID,其中大约 12,000 个是唯一的。
I am trying to use seaborn distplot in order to make figures using these values, but when distplot does the aggregation, I can only get it to count the number of records and I cannot get it to aggregate over the number of unique sessionIDs.我正在尝试使用 seaborn distplot 来制作使用这些值的数字,但是当 distplot 进行聚合时,我只能让它计算记录数,而我不能让它聚合唯一 sessionID 的数量。
Here is an example dataframe.这是一个示例 dataframe。
temp_df = pd.DataFrame([['d7d1b050-0e48-4c00-8061-c78817155b72',
'42773088-e38f-4578-bc2a-69d1797a90eb',
11,
'groupA'],
['962c397d-a8f8-4f1c-a589-ecf74a7da62d',
'b5baafb0-f6d4-4b4e-bc76-1287614b985d',
10,
'groupA'],
['a90fde40-9b9f-466e-bd5e-a40325b5fc9d',
'b3fba007-aef5-4a5f-a53b-94eb0705d953',
11,
'groupB'],
['22ebb056-603c-4f66-8240-8c54e8043509',
'b780fa66-addd-48c0-8db4-d755ebd351b8',
10,
'groupC'],
['52ffd64c-a5c1-4cd5-89c8-c1dcb8bd24b2',
'37482cb7-c354-4b4b-92b6-2aaa62811e5b',
10,
'groupA'],
['55524169-f159-4c31-b939-bb00e1cba804',
'34a9ff63-ea75-473d-ab89-9a92c3f4a8d9',
10,
'groupB'],
['2027d9d0-1e29-4d1f-969a-995a47f12052',
'875488ea-85a2-47cb-b1ea-62003bbce80a',
10,
'groupA'],
['10d9c9fb-b5dd-4581-b148-a6198abecec1',
'3f4b0604-513a-424b-98a3-e788ab3daa97',
11,
'groupD'],
['1c1e183b-6459-41bd-99aa-5f89b375006a',
'53dd2ffd-c9b0-49c3-9275-190716c78799',
10,
'groupB'],
['31030ded-64a7-4854-8042-585605141e71',
'f0514527-2d7b-4cad-a36f-f21e3425093c',
10,
'groupD'],
['cdfd5a0c-dd8c-4546-ba31-c2f021fb4859',
'1ed007fe-d4f7-41bc-8f3c-b163c57f8a1f',
11,
'groupE'],
['66bd16a5-b514-4d8a-ad7a-afb8921f7dd2',
'a2e9f137-bba5-46ec-8b13-7b17821de735',
10,
'groupB'],
['3cdb21d9-be3c-4723-bf28-0a7769d492b4',
'9a6f1516-54a0-4dda-83d7-e05311e87ff5',
10,
'groupE'],
['d25f4cb2-3bf7-4898-a8a3-91d9e1b58576',
'716a7732-6bcd-478d-87f9-c13cd83eaf66',
11,
'groupA'],
['e95134fd-7ce2-4e88-808c-e5abf13a4892',
'c021c21b-7bab-4e1f-9ff0-4dfc584263b8',
11,
'groupE'],
['e13da005-1033-466f-b984-48fdfa0988f2',
'5bcc0651-0775-4fa5-b521-ac90e0a33b1c',
10,
'groupB'],
['b60ee53d-e4fc-4e37-aa1c-df67f66e304e',
'592adca4-6fa6-48c3-be97-2357250d736d',
10,
'groupD'],
['c1d47246-838f-418a-a92d-7b5150122775',
'ff5d180c-cca9-474a-974e-e18c35cab912',
10,
'groupA'],
['fc129686-f7cd-407a-aca3-68f86c52af41',
'a18dfc3a-2ce6-43f7-a21f-4c7371cff2b6',
11,
'groupE'],
['191af645-cb9e-408a-af2e-b6826f7177b9',
'd430610b-b7da-42cb-aa93-c7f94774093c',
10,
'groupA']])
temp_df.columns = ['clickId', 'sessionId', 'month','group']
sns.displot(data=temp_df, x='month', hue='group')
Conceptually, I guess what I want to do is take the dataframe and eliminate all duplicate rows at the sessionId level, but I don't know how to do that.从概念上讲,我想我想做的是采用 dataframe 并消除 sessionId 级别的所有重复行,但我不知道该怎么做。
Can someone help me?有人能帮我吗?
Thanks, Brad谢谢,布拉德
The answer is surprisingly simple.答案出奇的简单。
When I was trying to draw the original plot, I was doing当我试图绘制原始 plot 时,我在做
sns.displot(temp_df, x='month', hue='group') which then included all of the data, so it was using unique row identifiers, but since I wanted to go with just sessionId, the solution I found was sns.displot(temp_df, x='month', hue='group') 然后包含所有数据,因此它使用唯一的行标识符,但由于我想只使用 sessionId 来 go,我找到的解决方案是
sns.displot(temp_df[['sessionId', 'month','group']].drop_duplicates(), x='month', hue='group') sns.displot(temp_df[['sessionId', 'month','group']].drop_duplicates(), x='month', hue='group')
and that works.那行得通。
Hopefully this helps someone else.希望这对其他人有帮助。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.