I have some data, the dataset includes features such as device id (int), phone (int), name (string), device id etc. But all data do not have the label. my task is to get the probability of a person using multiple ids or multiple devices. I have no idea how to do it, Does anyone have an idea?
for clear, here is an example. the dataset is like
name id phone device_id
Jason 123 12345 12341231 ......
James 1345 312312 312312312 ......
Jason 123 53523 23115124 ......
so we can find that Jason has 2 phone numbers,
how do I get the probability by using the machine-learning method or deep learning method?
One of the possible way to do this is to compute similarity of user.
As I can understand the device similarity for a user is your end goal.
For starters combine name and Id field which uniquely identifies an user. Generate, a feature vector for all remaining as an array.
Afterwards you can just run nested for loop with all user over other. This will give you closest match and you can set a threshold or you can pick kNN to do that.
take a look at this: Convert Nested dictionary to Pyspark Dataframe
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.