简体   繁体   中英

Is there any way to get the relationship from the unsupervised dataset?

I have some data, the dataset includes features such as device id (int), phone (int), name (string), device id etc. But all data do not have the label. my task is to get the probability of a person using multiple ids or multiple devices. I have no idea how to do it, Does anyone have an idea?

for clear, here is an example. the dataset is like

  name   id    phone  device_id  
 Jason   123    12345   12341231     ......  
 James   1345   312312  312312312    ......  
 Jason   123    53523   23115124    ......

so we can find that Jason has 2 phone numbers,
how do I get the probability by using the machine-learning method or deep learning method?

One of the possible way to do this is to compute similarity of user.

As I can understand the device similarity for a user is your end goal.

For starters combine name and Id field which uniquely identifies an user. Generate, a feature vector for all remaining as an array.

Afterwards you can just run nested for loop with all user over other. This will give you closest match and you can set a threshold or you can pick kNN to do that.

take a look at this: Convert Nested dictionary to Pyspark Dataframe

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM