简体   繁体   中英

Pearson's Correlation in Python

I have a python's code about similarity user with Pearson's Correlation and I want to analysis the step of calculation because I'm a beginner with Python hehe. When I try to calculate manually and compare with the result of this program, the result is always different. I'm wondering if I'm mistaken when try to calculate manually. The code is like this :

# A dictionary of movie critics and their ratings of a small set of movies


critics={'User 1': {'Spiderman': 1.0, 'Batman Begins': 2.0, 'Superman': 4.0},
  'User 2': {'Spiderman': 2.0, 'Batman Begins': 3.0, 'Superman': 3.0}
}


from math import sqrt

# Returns the Pearson correlation coefficient for p1 and p2
def sim_pearson(prefs,p1,p2):
  # Get the list of mutually rated items
  si={}
  for item in prefs[p1]:
    if item in prefs[p2]: si[item]=1

  # if they are no ratings in common, return 0
  if len(si)==0: return 0

  # Sum calculations
  n=len(si)

  # Sums of all the preferences
  sum1=sum([prefs[p1][it] for it in si])
  sum2=sum([prefs[p2][it] for it in si])

  # Sums of the squares
  sum1Sq=sum([pow(prefs[p1][it],2) for it in si])
  sum2Sq=sum([pow(prefs[p2][it],2) for it in si])

  # Sum of the products
  pSum=sum([prefs[p1][it]*prefs[p2][it] for it in si])

  # Calculate r (Pearson score)
  num=pSum-(sum1*sum2/n)
  den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n))
  if den==0: return 0

  r=num/den

  return r


def main():
    z = sim_pearson(critics, 'User 1','User 2')
    print z

if __name__ == "__main__":
    main()

I want to calculate the similarity between User 1 and User 2. But I'm confused in this part :

([prefs[p1][it] for it in si])

what is the meaning of [it]?

The result of the similarity if I use this program is : 0.755928946018

is true the meaning of this code ([prefs[p1][it] for it in si]) is multiplying the ratings of User 1? Like 1*2*4 ? or it has to be multiplying with the ratings of User 2? Like (1*2)+(1*3)+(4*3) ?

I'm confused with the [p1][it] . I hope you can help me, thanks for advance.

Lets take the line you're confused by and break it down:

sum1=sum([prefs[p1][it] for it in si])

Lets work from the outside to the inside. The outer most part is an assignment statement, so we're computing some value to assign to sum1 :

sum1 = ...

Now, lets see what is being assigned:

sum1 = sum(...)

The builtin sum function expects to be passed an iterable object (such as a list). Lets see what the argument is:

sum1 = sum([... for it in si])

The outer square brackets tell us we're getting a list. The for it in si syntax means that this is a list comprehension. Python runs a for loop to produce the items of the list. I'm pretty sure the variable name it stands for "item". The loop is over the keys of the dictionary si (which really should have been created as list, since you never care about its values).

It's worth noting that you could actually leave out the square brackets from the code here, and rather than being a list comprehension that produces a full list all up front, it would be a "lazy" generator expression which creates an iterable generator object that computes each value one by one, as they are requested.

Anyway, lets find out what the items of the list comprehension are:

sum1 = sum([...[it] for it in si])

The square brackets in [it] are indexing syntax, since they're immediately to the right of another expression. You can index lists and tuples with integers, and dictionaries with any kind of hashable objects (such as strings). In this case, they key is it , which is our loop variable in the list comprehension. Lets see what we're indexing:

sum1 = sum([...[p1][it] for it in si])

The [it] indexing is being applied to the result of a previous indexing, this time with [p1] , which is an argument to the function. Lets see what this indexing is done on:

sum1 = sum([prefs[p1][it] for it in si])

The indexing is being done on prefs , which is a global variable holding a dictionary. The keys to that dictionary are reviewers (which is what p1 must be), and the values are further nested dictionaries mapping from movie names to ratings.

So, the statement adds up one reviewer's scores for a specific set of movies, and assigns them to sum1 . With your example data, this will be 1.0 + 2.0 + 4.0 , so sum1 will be assigned 7.0 .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM