简体   繁体   中英

Looking for efficient way to build matrix from yelp review dataset in python

Currently I'm looking for efficient way to build a matrix of rating for recommendation system in Python.

The matrix should look like this:

4|0|0|
5|2|0|
5|0|0|
4|0|0|
4|0|0|
4|0|0|
4|4|0|
2|0|0|
0|4|0|
0|3|0|
0|0|3|
0|0|5|
0|0|4|

Specifically, the columns are business_id and the rows are user_id

      |bus-1|bus-2|
user-1|stars|stars|
user-2|stars|stars|

Currently I'm using this Yelp review data set stored in MongoDB:

_id: "----X0BIDP9tA49U3RvdSQ"
user_id: "gVmUR8rqUFdbSeZbsg6z_w"
business_id: "Ue6-WhXvI-_1xUIuapl0zQ"
stars: 4
useful: 1
funny: 0
cool: 0
text: "Red, white and bleu salad was super yum and a great addition to the me..."
date: "2014-02-17 16:48:49"

My approach is by building a list of unique business_id and user_id from review table and querying those value in review table again.

I've included my code here, as you can see because of the brute force approach, it took a long time just to build small matrix just like the one I included earlier.

Here's some snippet of my code:

def makeBisnisArray(cityNameParam):
    arrayBisnis = []

    #Append business id filtered by cityNameParam to the bisnis array
    bisnisInCity = colBisnis.find({"city": cityNameParam})
    for bisnis in bisnisInCity:
        #if the business id is not in array, then append it to the array
        if(not(bisnis in arrayBisnis)):
            arrayBisnis.append(bisnis["_id"])
    return arrayBisnis

def makeUserArray(bisnisName):
    global arrayUser

    #find review filtered by bisnisName
    hslReview = colReview.find({"business_id": bisnisName})
    for review in hslReview:
        #if the user id is not already in array, append it to the array
        if(not(review['user_id'] in arrayUser)):
            arrayUser.append(review['user_id'])


def writeRatingMatrix(arrayBisnis, arrayUser):
    f = open("file.txt", "w")
    for user in arrayUser:
        for bisnis in arrayBisnis:
            #find one instance from the database by business_id and user_id
            x = colReview.find_one({"business_id": bisnis, "user_id": user})

            #if there's none, then just write the rating as 0
            if x is None :
                f.write('0|')
            #if found, write the star value
            else:
                f.write((str(x['stars'])+"|"))
        print()
        f.write('\n')


def buildCityTable(cityName):
    arrayBisnis = makeBisnisArray(cityName)
    global arrayUser
    for bisnis in arrayBisnis:
        makeUserArray(bisnis)
    writeRatingMatrix(arrayBisnis, arrayUser) 


arrayUser = []
cityNameVar = 'Pointe-Aux-Trembles'
buildCityTable(cityNameVar)

Can anyone suggest more efficient way to build the rating matrix for me?

There are several general approaches you can take to speed this up.

  1. Use sets or dictionaries to establish a unique set of businesses and users respectively; Set/Dict lookups are much faster than list searches.
  2. Process the yelp file one entry at a time, once
  3. Use something like numpy or pandas to build your matrix

Something like this


users = {}
businesses = {}
ratings = {}

for entry in yelp_entries:
    if entry['user_id'] not in users:
        users[entry['user_id']] = len(users)
    if entry['business_id'] not in businesses:
        businesses[entry['business_id']] = len(businesses)
    ratings.append((
        users[[entry['user_id']],
        businesses[entry['business_id']],
        entry['stars']
    ))

matrix = numpy.tile(0, (len(users), len(businesses))
for r in ratings:
    matrix[r[0]][r[1]] = r[2]

I modified @sirlark's code to match my need, but for some reason i cannot use append on ratings and iterate over it with for r in ratings so i had to change the code like this

users = {}
businesses = {}
ratings = {}

#Query the yelp_entries for all reviews matching business_id and store it in businesses first

for entry in yelp_entries:
    if entry['business_id'] not in businesses:
        businesses[entry['business_id']] = len(businesses)
    if entry['user_id'] not in users:
        users[entry['user_id']] = len(users)
    ratings[len(ratings)]=(users[entry['user_id']],
                           businesses[entry['business_id']],
                           int(entry['stars']))

matrix = numpy.tile(0, (len(users), len(businesses))

for ind in range(0,len(ratings)):
        matrix[ratings[ind][0]][ratings[ind][1]] = ratings[ind][2]

Later i found out that other than using tile method We can also use SciPy_coo matrix which is slightly faster than above method, but we need to modify the code a bit

from scipy.sparse import coo_matrix
users = {}
businesses = {}
ratings = {}
row = []
col = []
data = []

for entry in yelp_entries:
    if entry['business_id'] not in businesses:
        businesses[entry['business_id']] = len(businesses)
    if entry['user_id'] not in users:
        users[entry['user_id']] = len(users)
    col.append(businesses[review['business_id']])
    row.append(users[review['user_id']])
    data.append(int(review['stars']))

matrix = coo_matrix((data, (row, col))).toarray()

note : Later i found out the reason why i can't .append() or .add() to ratings variable is because

ratings = {}

counts as dict data type, to declare a set data type you should use this instead:

ratings = set()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM