[英]How to represent each user by a unique row (Python)?
I have data like this:我有这样的数据:
UserId Date Part_of_day Apps Category Frequency Duration_ToT
1 2020-09-10 evening Settings System tool 1 3.436
1 2020-09-11 afternoon Calendar Calendar 5 9.965
1 2020-09-11 afternoon Contacts Phone_and_SMS 7 2.606
2 2020-09-11 afternoon Facebook Social 15 50.799
2 2020-09-11 afternoon clock System tool 2 5.223
3 2020-11-18 morning Contacts Phone_and_SMS 3 1.726
3 2020-11-18 morning Google Productivity 1 4.147
3 2020-11-18 morning Instagram Social 1 0.501
.......................................
67 2020-11-18 morning Truecaller Communication 1 1.246
67 2020-11-18 night Instagram Social 3 58.02
I'am trying to reduce the diemnsionnality of my dataframe to set the entries for k-means.我正在尝试减少 dataframe 的维度来设置 k-means 的条目。 I'd like to ask it's possible to represent each user by one row?我想问可以用一行来代表每个用户吗? what do you think to Embedding?你怎么看嵌入? How can i do please.请问我该怎么做。 I can't find any solution我找不到任何解决方案
This depends on how you want to aggregate the values.这取决于您希望如何聚合这些值。 Here is a small example how to do it with groupby
and agg
.这是一个如何使用groupby
和agg
的小示例。
First I create some sample data.首先,我创建一些示例数据。
import pandas as pd
import random
df = pd.DataFrame({
"id": [int(i/3) for i in range(20)],
"val1": [random.random() for _ in range(20)],
"val2": [str(int(random.random()*100)) for _ in range(20)]
})
>>> df.head()
id val1 val2
0 0 0.174553 49
1 0 0.724547 95
2 0 0.369883 3
3 1 0.243191 64
4 1 0.575982 16
>>> df.dtypes
id int64
val1 float64
val2 object
dtype: object
Then we group by the id and aggregate the values according to the functions you specify in the dictionary you pass to agg
.然后我们按 id 分组并根据您在传递给agg
的字典中指定的函数聚合值。 In this example I sum up the float values and join the strings with an underscore separator.在此示例中,我总结了浮点值并使用下划线分隔符连接字符串。 You could eg also pass the list function to store the values in a list.例如,您还可以传递列表 function 以将值存储在列表中。
>>> df.groupby("id").agg({"val1": sum, "val2": "__".join})
val1 val2
id
0 1.268984 49__95__3
1 0.856992 64__16__54
2 2.186370 30__59__21
3 1.486925 29__47__77
4 1.523898 19__78__99
5 0.855413 59__74__73
6 0.201787 63__33
EDIT regarding the comment "But how can we make val2 contain the top 5 applications according to the duration of the application?":编辑关于评论“但是我们如何使 val2 根据应用程序的持续时间包含前 5 个应用程序?”:
The agg
method is restricted in the sense that you cannot access other attributes while aggregating. agg
方法在聚合时无法访问其他属性的意义上受到限制。 To do that you should use the apply
method.为此,您应该使用apply
方法。 You pass it a function, that processes the whole group and returns a row as Series
object.你传递给它一个 function,它处理整个组并返回一行作为 object Series
。
In this example I still use the sum for val1, but for val2 I return the val2 of the row with the highest val1.在此示例中,我仍然使用 val1 的总和,但对于 val2,我返回具有最高 val1 的行的 val2。 This should make clear how to make the aggregation depend on other attributes.这应该清楚如何使聚合依赖于其他属性。
def apply_func(group):
return pd.Series({
"id": group["id"].iat[0],
"val1": group["val1"].sum(),
"val2": group["val2"].iat[group["val1"].argmax()]
})
>>> df.groupby("id").apply(apply_func)
id val1 val2
id
0 0 1.749955 95
1 1 0.344372 65
2 2 2.019035 70
3 3 2.444691 36
4 4 2.573576 92
5 5 1.453769 72
6 6 1.811516 94
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.