[英]Is there any way to get the relationship from the unsupervised dataset?
I have some data, the dataset includes features such as device id (int), phone (int), name (string), device id etc. But all data do not have the label.我有一些数据,数据集包括设备ID (int)、电话(int)、名称(字符串)、设备ID等特征。但所有数据都没有标签。 my task is to get the probability of a person using multiple ids or multiple devices.
我的任务是获取一个人使用多个 ID 或多个设备的概率。 I have no idea how to do it, Does anyone have an idea?
我不知道该怎么做,有人有想法吗?
for clear, here is an example.为了清楚起见,这里有一个例子。 the dataset is like
数据集就像
name id phone device_id
Jason 123 12345 12341231 ......
James 1345 312312 312312312 ......
Jason 123 53523 23115124 ......
so we can find that Jason has 2 phone numbers,所以我们可以发现 Jason 有 2 个电话号码,
how do I get the probability by using the machine-learning method or deep learning method?如何使用机器学习方法或深度学习方法获得概率?
One of the possible way to do this is to compute similarity of user.一种可能的方法是计算用户的相似度。
As I can understand the device similarity for a user is your end goal.据我所知,用户的设备相似性是您的最终目标。
For starters combine name and Id field which uniquely identifies an user.首先,结合唯一标识用户的名称和 Id 字段。 Generate, a feature vector for all remaining as an array.
生成,所有剩余的特征向量作为数组。
Afterwards you can just run nested for loop with all user over other.之后,您可以与所有用户一起运行嵌套的 for 循环。 This will give you closest match and you can set a threshold or you can pick kNN to do that.
这会给你最接近的匹配,你可以设置一个阈值,或者你可以选择 kNN 来做到这一点。
take a look at this: Convert Nested dictionary to Pyspark Dataframe看看这个: Convert Nested dictionary to Pyspark Dataframe
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.