This is my first question here in stackoverflow.
I am new to python and I am trying to implement link prediction problem.
I have a list like this:
list_pos = [('alpha', 'beta'),
('beta','gama')
('alpha','lamda')
('gama', 'lamda'),
('euphor', 'tuphor')]
And, I am able to generate negative examples of tuple pairs which do not exist previously as follows:
from itertools import combinations
elements = list(set([e for l in list_pos for e in l])) # find all unique elements
complete_list = list(combinations(elements, 2)) # generate all possible combinations
#convert to sets to negate the order
set1 = [set(l) for l in list_pos]
complete_set = [set(l) for l in complete_list]
# find sets in `complete_set` but not in `set1`
list_neg = [list(l) for l in complete_set if l not in set1]
The output is here:
list_neg =
[['gama', 'tuphor'],
['gama', 'alpha'],
['gama', 'euphor'],
['lamda', 'tuphor'],
['alpha', 'tuphor'],
['beta', 'tuphor'],
['euphor', 'lamda'],
['lamda', 'beta'],
['euphor', 'alpha'],
['euphor', 'beta']]
However, this leads to the following - for 5 positive examples, I have 10 negative examples.
With more items in original list, finally I will end up with a highly unbalanced dataset having lot of negative examples which will effect my model training scores.
My question is - how to train such unbalanced datasets with good accuracy.
For generating my final dataset, I am using the following -
dflp = pd.DataFrame(list_pos, columns=['user1','user2'])
dflp['link'] = 1
dfln = pd.DataFrame(list_neg, columns=['user1','user2'])
dfln['link'] = 0
df_n = pd.concat([dflp, dfln])
df_n.head()
This way I have a dataset suitable for applying logistic regression
If the dataset is large enough, should try to delete some of the negative examples in order to have a balanced dataset.
If the dataset is not large enough you can still delete some of the negative examples and try cross-validation methods like Leave One Out/ JackKnife. These methods to train models are used when the train dataset is small (train dataset < 100 rows)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.