简体   繁体   English

Pearson在Python中的相关性

[英]Pearson's Correlation in Python

I have a python's code about similarity user with Pearson's Correlation and I want to analysis the step of calculation because I'm a beginner with Python hehe. 我有一个与Pearson's Correlation有关的相似用户的python代码,我想分析计算步骤,因为我是使用Python hehe的初学者。 When I try to calculate manually and compare with the result of this program, the result is always different. 当我尝试手动计算并与该程序的结果进行比较时,结果总是不同的。 I'm wondering if I'm mistaken when try to calculate manually. 我想知道是否在尝试手动计算时弄错了。 The code is like this : 代码是这样的:

# A dictionary of movie critics and their ratings of a small set of movies


critics={'User 1': {'Spiderman': 1.0, 'Batman Begins': 2.0, 'Superman': 4.0},
  'User 2': {'Spiderman': 2.0, 'Batman Begins': 3.0, 'Superman': 3.0}
}


from math import sqrt

# Returns the Pearson correlation coefficient for p1 and p2
def sim_pearson(prefs,p1,p2):
  # Get the list of mutually rated items
  si={}
  for item in prefs[p1]:
    if item in prefs[p2]: si[item]=1

  # if they are no ratings in common, return 0
  if len(si)==0: return 0

  # Sum calculations
  n=len(si)

  # Sums of all the preferences
  sum1=sum([prefs[p1][it] for it in si])
  sum2=sum([prefs[p2][it] for it in si])

  # Sums of the squares
  sum1Sq=sum([pow(prefs[p1][it],2) for it in si])
  sum2Sq=sum([pow(prefs[p2][it],2) for it in si])

  # Sum of the products
  pSum=sum([prefs[p1][it]*prefs[p2][it] for it in si])

  # Calculate r (Pearson score)
  num=pSum-(sum1*sum2/n)
  den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n))
  if den==0: return 0

  r=num/den

  return r


def main():
    z = sim_pearson(critics, 'User 1','User 2')
    print z

if __name__ == "__main__":
    main()

I want to calculate the similarity between User 1 and User 2. But I'm confused in this part : 我想计算用户1和用户2之间的相似度。但是我对这一部分感到困惑:

([prefs[p1][it] for it in si])

what is the meaning of [it]? [it]是什么意思?

The result of the similarity if I use this program is : 0.755928946018 如果使用此程序,相似度结果为:0.755928946018

is true the meaning of this code ([prefs[p1][it] for it in si]) is multiplying the ratings of User 1? 该代码的含义是否正确([prefs[p1][it] for it in si])乘以用户1的等级? Like 1*2*4 ? 1*2*4吗? or it has to be multiplying with the ratings of User 2? 还是必须与用户2的评分相乘? Like (1*2)+(1*3)+(4*3) ? (1*2)+(1*3)+(4*3)吗?

I'm confused with the [p1][it] . 我对[p1][it]感到困惑。 I hope you can help me, thanks for advance. 希望您能对我有所帮助,谢谢。

Lets take the line you're confused by and break it down: 让您困惑的那条线将其分解:

sum1=sum([prefs[p1][it] for it in si])

Lets work from the outside to the inside. 让我们从外到内进行工作。 The outer most part is an assignment statement, so we're computing some value to assign to sum1 : 最外面的部分是赋值语句,因此我们正在计算一些要赋给sum1值:

sum1 = ...

Now, lets see what is being assigned: 现在,让我们看看分配了什么:

sum1 = sum(...)

The builtin sum function expects to be passed an iterable object (such as a list). 内置sum函数期望传递一个可迭代的对象(例如列表)。 Lets see what the argument is: 让我们看看参数是什么:

sum1 = sum([... for it in si])

The outer square brackets tell us we're getting a list. 外方括号告诉我们我们正在获取列表。 The for it in si syntax means that this is a list comprehension. for it in si语法中的for it in si表示这是列表理解。 Python runs a for loop to produce the items of the list. Python运行for循环以生成列表项。 I'm pretty sure the variable name it stands for "item". 我敢肯定,变量名it代表了“项”。 The loop is over the keys of the dictionary si (which really should have been created as list, since you never care about its values). 循环遍历字典si的键(实际上应该将其创建为列表,因为您从不关心其值)。

It's worth noting that you could actually leave out the square brackets from the code here, and rather than being a list comprehension that produces a full list all up front, it would be a "lazy" generator expression which creates an iterable generator object that computes each value one by one, as they are requested. 值得注意的是,您实际上可以在这里的代码中省略方括号,而不是作为列表推导来产生所有列表的完整列表,它会是一个“惰性”生成器表达式,该表达式创建一个可迭代的生成器对象,用于计算每个值都按要求一一对应。

Anyway, lets find out what the items of the list comprehension are: 无论如何,让我们找出列表理解的内容是:

sum1 = sum([...[it] for it in si])

The square brackets in [it] are indexing syntax, since they're immediately to the right of another expression. [it]中的方括号是索引语法,因为它们在另一个表达式的右边。 You can index lists and tuples with integers, and dictionaries with any kind of hashable objects (such as strings). 您可以使用整数索引列表和元组,并使用任何可哈希对象(例如字符串)来索引字典。 In this case, they key is it , which is our loop variable in the list comprehension. 在这种情况下,它们的关键是it ,这是列表推导中的循环变量。 Lets see what we're indexing: 让我们看看我们正在建立索引:

sum1 = sum([...[p1][it] for it in si])

The [it] indexing is being applied to the result of a previous indexing, this time with [p1] , which is an argument to the function. [it]索引将应用于先前索引的结果,这次是[p1] ,它是函数的参数。 Lets see what this indexing is done on: 让我们看看此索引是如何进行的:

sum1 = sum([prefs[p1][it] for it in si])

The indexing is being done on prefs , which is a global variable holding a dictionary. 正在对prefs进行索引,这是一个保存字典的全局变量。 The keys to that dictionary are reviewers (which is what p1 must be), and the values are further nested dictionaries mapping from movie names to ratings. 该词典的键是审阅者(必须是p1 ),并且值是从电影名称到评分的映射的进一步嵌套词典。

So, the statement adds up one reviewer's scores for a specific set of movies, and assigns them to sum1 . 因此,该语句将一组特定电影的评论者得分加起来,并将其分配给sum1 With your example data, this will be 1.0 + 2.0 + 4.0 , so sum1 will be assigned 7.0 . 对于您的示例数据,这将是1.0 + 2.0 + 4.0 ,因此sum1将被分配为7.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM