寻找从python中的yelp评论数据集构建矩阵的有效方法

Question

Currently I'm looking for efficient way to build a matrix of rating for recommendation system in Python.目前我正在寻找一种有效的方法来为 Python 中的推荐系统构建评分矩阵。

The matrix should look like this:矩阵应如下所示：

4|0|0|
5|2|0|
5|0|0|
4|0|0|
4|0|0|
4|0|0|
4|4|0|
2|0|0|
0|4|0|
0|3|0|
0|0|3|
0|0|5|
0|0|4|

Specifically, the columns are business_id and the rows are user_id具体来说，列是business_id ，行是user_id

      |bus-1|bus-2|
user-1|stars|stars|
user-2|stars|stars|

Currently I'm using this Yelp review data set stored in MongoDB:目前我正在使用存储在 MongoDB 中的这个 Yelp 评论数据集：

_id: "----X0BIDP9tA49U3RvdSQ"
user_id: "gVmUR8rqUFdbSeZbsg6z_w"
business_id: "Ue6-WhXvI-_1xUIuapl0zQ"
stars: 4
useful: 1
funny: 0
cool: 0
text: "Red, white and bleu salad was super yum and a great addition to the me..."
date: "2014-02-17 16:48:49"

My approach is by building a list of unique business_id and user_id from review table and querying those value in review table again.我的方法是从评论表中构建一个唯一的business_id和user_id列表，并再次在评论表中查询这些值。

I've included my code here, as you can see because of the brute force approach, it took a long time just to build small matrix just like the one I included earlier.我已经在此处包含了我的代码，正如您所看到的，由于采用了蛮力方法，就像我之前包含的那样构建小矩阵需要很长时间。

Here's some snippet of my code:这是我的代码的一些片段：

def makeBisnisArray(cityNameParam):
    arrayBisnis = []

    #Append business id filtered by cityNameParam to the bisnis array
    bisnisInCity = colBisnis.find({"city": cityNameParam})
    for bisnis in bisnisInCity:
        #if the business id is not in array, then append it to the array
        if(not(bisnis in arrayBisnis)):
            arrayBisnis.append(bisnis["_id"])
    return arrayBisnis

def makeUserArray(bisnisName):
    global arrayUser

    #find review filtered by bisnisName
    hslReview = colReview.find({"business_id": bisnisName})
    for review in hslReview:
        #if the user id is not already in array, append it to the array
        if(not(review['user_id'] in arrayUser)):
            arrayUser.append(review['user_id'])


def writeRatingMatrix(arrayBisnis, arrayUser):
    f = open("file.txt", "w")
    for user in arrayUser:
        for bisnis in arrayBisnis:
            #find one instance from the database by business_id and user_id
            x = colReview.find_one({"business_id": bisnis, "user_id": user})

            #if there's none, then just write the rating as 0
            if x is None :
                f.write('0|')
            #if found, write the star value
            else:
                f.write((str(x['stars'])+"|"))
        print()
        f.write('\n')


def buildCityTable(cityName):
    arrayBisnis = makeBisnisArray(cityName)
    global arrayUser
    for bisnis in arrayBisnis:
        makeUserArray(bisnis)
    writeRatingMatrix(arrayBisnis, arrayUser) 


arrayUser = []
cityNameVar = 'Pointe-Aux-Trembles'
buildCityTable(cityNameVar)

Can anyone suggest more efficient way to build the rating matrix for me?谁能建议更有效的方法来为我构建评分矩阵？

Answer 1

There are several general approaches you can take to speed this up.您可以采用几种通用方法来加快速度。

Use sets or dictionaries to establish a unique set of businesses and users respectively;使用集合或字典分别建立唯一的业务和用户集合； Set/Dict lookups are much faster than list searches. Set/Dict 查找比列表搜索快得多。
Process the yelp file one entry at a time, once一次处理 yelp 文件一个条目
Use something like numpy or pandas to build your matrix使用 numpy 或 pandas 之类的东西来构建矩阵

Something like this像这样的东西


users = {}
businesses = {}
ratings = {}

for entry in yelp_entries:
    if entry['user_id'] not in users:
        users[entry['user_id']] = len(users)
    if entry['business_id'] not in businesses:
        businesses[entry['business_id']] = len(businesses)
    ratings.append((
        users[[entry['user_id']],
        businesses[entry['business_id']],
        entry['stars']
    ))

matrix = numpy.tile(0, (len(users), len(businesses))
for r in ratings:
    matrix[r[0]][r[1]] = r[2]

Answer 2

I modified @sirlark's code to match my need, but for some reason i cannot use append on ratings and iterate over it with for r in ratings so i had to change the code like this我修改了@sirlark 的代码以满足我的需要，但由于某种原因，我不能在 ratings 上使用append 并在 ratings 中使用for r对其进行迭代，所以我不得不像这样更改代码

users = {}
businesses = {}
ratings = {}

#Query the yelp_entries for all reviews matching business_id and store it in businesses first

for entry in yelp_entries:
    if entry['business_id'] not in businesses:
        businesses[entry['business_id']] = len(businesses)
    if entry['user_id'] not in users:
        users[entry['user_id']] = len(users)
    ratings[len(ratings)]=(users[entry['user_id']],
                           businesses[entry['business_id']],
                           int(entry['stars']))

matrix = numpy.tile(0, (len(users), len(businesses))

for ind in range(0,len(ratings)):
        matrix[ratings[ind][0]][ratings[ind][1]] = ratings[ind][2]

Later i found out that other than using tile method We can also use SciPy_coo matrix which is slightly faster than above method, but we need to modify the code a bit后来我发现除了使用tile方法我们也可以使用SciPy_coo矩阵，它比上面的方法稍微快一点，但我们需要稍微修改一下代码

from scipy.sparse import coo_matrix
users = {}
businesses = {}
ratings = {}
row = []
col = []
data = []

for entry in yelp_entries:
    if entry['business_id'] not in businesses:
        businesses[entry['business_id']] = len(businesses)
    if entry['user_id'] not in users:
        users[entry['user_id']] = len(users)
    col.append(businesses[review['business_id']])
    row.append(users[review['user_id']])
    data.append(int(review['stars']))

matrix = coo_matrix((data, (row, col))).toarray()

note : Later i found out the reason why i can't .append() or .add() to ratings variable is because注意：后来我发现我不能 .append() 或 .add() 到 ratings 变量的原因是因为

ratings = {}

counts as dict data type, to declare a set data type you should use this instead:算作 dict 数据类型，要声明一个集合数据类型，您应该使用它：

ratings = set()

寻找从python中的yelp评论数据集构建矩阵的有效方法

问题描述

2 个解决方案

解决方案1
0 2020-01-30 16:29:16

解决方案2
0 已采纳 2020-02-01 03:31:24

寻找从python中的yelp评论数据集构建矩阵的有效方法

问题描述

2 个解决方案

解决方案1 0 2020-01-30 16:29:16

解决方案2 0 已采纳 2020-02-01 03:31:24

解决方案1
0 2020-01-30 16:29:16

解决方案2
0 已采纳 2020-02-01 03:31:24