如何使用逻辑回归训练高度不平衡的数据进行链接预测

Question

This is my first question here in stackoverflow.这是我在stackoverflow中的第一个问题。

I am new to python and I am trying to implement link prediction problem.我是 python 的新手，我正在尝试实现链接预测问题。

I have a list like this:我有一个这样的列表：

list_pos = [('alpha', 'beta'),
         ('beta','gama')
         ('alpha','lamda')
         ('gama', 'lamda'),
         ('euphor', 'tuphor')]

And, I am able to generate negative examples of tuple pairs which do not exist previously as follows:而且，我能够生成以前不存在的元组对的负面示例，如下所示：

from itertools import combinations
elements = list(set([e for l in list_pos for e in l])) # find all unique elements

complete_list = list(combinations(elements, 2)) # generate all possible combinations

#convert to sets to negate the order

set1 = [set(l) for l in list_pos]
complete_set = [set(l) for l in complete_list]

# find sets in `complete_set` but not in `set1`
list_neg = [list(l) for l in complete_set if l not in set1]

The output is here: output 在这里：

list_neg = 
[['gama', 'tuphor'],
 ['gama', 'alpha'],
 ['gama', 'euphor'],
 ['lamda', 'tuphor'],
 ['alpha', 'tuphor'],
 ['beta', 'tuphor'],
 ['euphor', 'lamda'],
 ['lamda', 'beta'],
 ['euphor', 'alpha'],
 ['euphor', 'beta']]

However, this leads to the following - for 5 positive examples, I have 10 negative examples.但是，这会导致以下结果 - 对于 5 个正面示例，我有 10 个负面示例。

With more items in original list, finally I will end up with a highly unbalanced dataset having lot of negative examples which will effect my model training scores.随着原始列表中的项目越来越多，最终我将得到一个高度不平衡的数据集，其中包含大量负面示例，这将影响我的 model 训练分数。

My question is - how to train such unbalanced datasets with good accuracy.我的问题是——如何准确地训练这种不平衡的数据集。

For generating my final dataset, I am using the following -为了生成我的最终数据集，我使用以下 -

dflp = pd.DataFrame(list_pos, columns=['user1','user2'])
dflp['link'] = 1
dfln = pd.DataFrame(list_neg, columns=['user1','user2'])
dfln['link'] = 0
df_n = pd.concat([dflp, dfln])
df_n.head()

This way I have a dataset suitable for applying logistic regression这样我就有了一个适合应用逻辑回归的数据集

Answer 1

If the dataset is large enough, should try to delete some of the negative examples in order to have a balanced dataset.如果数据集足够大，应该尽量删除一些负例，以得到一个平衡的数据集。

If the dataset is not large enough you can still delete some of the negative examples and try cross-validation methods like Leave One Out/ JackKnife.如果数据集不够大，您仍然可以删除一些负面示例并尝试交叉验证方法，例如 Leave One Out/JackKnife。 These methods to train models are used when the train dataset is small (train dataset < 100 rows)当训练数据集较小（训练数据集 < 100 行）时使用这些训练模型的方法

如何使用逻辑回归训练高度不平衡的数据进行链接预测

问题描述

1 个解决方案

解决方案1
-1 2020-07-16 09:29:02

如何使用逻辑回归训练高度不平衡的数据进行链接预测

问题描述

1 个解决方案

解决方案1 -1 2020-07-16 09:29:02

解决方案1
-1 2020-07-16 09:29:02