简体   繁体   English

寻找从python中的yelp评论数据集构建矩阵的有效方法

[英]Looking for efficient way to build matrix from yelp review dataset in python

Currently I'm looking for efficient way to build a matrix of rating for recommendation system in Python.目前我正在寻找一种有效的方法来为 Python 中的推荐系统构建评分矩阵。

The matrix should look like this:矩阵应如下所示:

4|0|0|
5|2|0|
5|0|0|
4|0|0|
4|0|0|
4|0|0|
4|4|0|
2|0|0|
0|4|0|
0|3|0|
0|0|3|
0|0|5|
0|0|4|

Specifically, the columns are business_id and the rows are user_id具体来说,列是business_id ,行是user_id

      |bus-1|bus-2|
user-1|stars|stars|
user-2|stars|stars|

Currently I'm using this Yelp review data set stored in MongoDB:目前我正在使用存储在 MongoDB 中的这个 Yelp 评论数据集:

_id: "----X0BIDP9tA49U3RvdSQ"
user_id: "gVmUR8rqUFdbSeZbsg6z_w"
business_id: "Ue6-WhXvI-_1xUIuapl0zQ"
stars: 4
useful: 1
funny: 0
cool: 0
text: "Red, white and bleu salad was super yum and a great addition to the me..."
date: "2014-02-17 16:48:49"

My approach is by building a list of unique business_id and user_id from review table and querying those value in review table again.我的方法是从评论表中构建一个唯一的business_iduser_id列表,并再次在评论表中查询这些值。

I've included my code here, as you can see because of the brute force approach, it took a long time just to build small matrix just like the one I included earlier.我已经在此处包含了我的代码,正如您所看到的,由于采用了蛮力方法,就像我之前包含的那样构建小矩阵需要很长时间。

Here's some snippet of my code:这是我的代码的一些片段:

def makeBisnisArray(cityNameParam):
    arrayBisnis = []

    #Append business id filtered by cityNameParam to the bisnis array
    bisnisInCity = colBisnis.find({"city": cityNameParam})
    for bisnis in bisnisInCity:
        #if the business id is not in array, then append it to the array
        if(not(bisnis in arrayBisnis)):
            arrayBisnis.append(bisnis["_id"])
    return arrayBisnis

def makeUserArray(bisnisName):
    global arrayUser

    #find review filtered by bisnisName
    hslReview = colReview.find({"business_id": bisnisName})
    for review in hslReview:
        #if the user id is not already in array, append it to the array
        if(not(review['user_id'] in arrayUser)):
            arrayUser.append(review['user_id'])


def writeRatingMatrix(arrayBisnis, arrayUser):
    f = open("file.txt", "w")
    for user in arrayUser:
        for bisnis in arrayBisnis:
            #find one instance from the database by business_id and user_id
            x = colReview.find_one({"business_id": bisnis, "user_id": user})

            #if there's none, then just write the rating as 0
            if x is None :
                f.write('0|')
            #if found, write the star value
            else:
                f.write((str(x['stars'])+"|"))
        print()
        f.write('\n')


def buildCityTable(cityName):
    arrayBisnis = makeBisnisArray(cityName)
    global arrayUser
    for bisnis in arrayBisnis:
        makeUserArray(bisnis)
    writeRatingMatrix(arrayBisnis, arrayUser) 


arrayUser = []
cityNameVar = 'Pointe-Aux-Trembles'
buildCityTable(cityNameVar)

Can anyone suggest more efficient way to build the rating matrix for me?谁能建议更有效的方法来为我构建评分矩阵?

There are several general approaches you can take to speed this up.您可以采用几种通用方法来加快速度。

  1. Use sets or dictionaries to establish a unique set of businesses and users respectively;使用集合或字典分别建立唯一的业务和用户集合; Set/Dict lookups are much faster than list searches. Set/Dict 查找比列表搜索快得多。
  2. Process the yelp file one entry at a time, once一次处理 yelp 文件一个条目
  3. Use something like numpy or pandas to build your matrix使用 numpy 或 pandas 之类的东西来构建矩阵

Something like this像这样的东西


users = {}
businesses = {}
ratings = {}

for entry in yelp_entries:
    if entry['user_id'] not in users:
        users[entry['user_id']] = len(users)
    if entry['business_id'] not in businesses:
        businesses[entry['business_id']] = len(businesses)
    ratings.append((
        users[[entry['user_id']],
        businesses[entry['business_id']],
        entry['stars']
    ))

matrix = numpy.tile(0, (len(users), len(businesses))
for r in ratings:
    matrix[r[0]][r[1]] = r[2]

I modified @sirlark's code to match my need, but for some reason i cannot use append on ratings and iterate over it with for r in ratings so i had to change the code like this我修改了@sirlark 的代码以满足我的需要,但由于某种原因,我不能在 ratings 上使用append 并在 ratings 中使用for r对其进行迭代,所以我不得不像这样更改代码

users = {}
businesses = {}
ratings = {}

#Query the yelp_entries for all reviews matching business_id and store it in businesses first

for entry in yelp_entries:
    if entry['business_id'] not in businesses:
        businesses[entry['business_id']] = len(businesses)
    if entry['user_id'] not in users:
        users[entry['user_id']] = len(users)
    ratings[len(ratings)]=(users[entry['user_id']],
                           businesses[entry['business_id']],
                           int(entry['stars']))

matrix = numpy.tile(0, (len(users), len(businesses))

for ind in range(0,len(ratings)):
        matrix[ratings[ind][0]][ratings[ind][1]] = ratings[ind][2]

Later i found out that other than using tile method We can also use SciPy_coo matrix which is slightly faster than above method, but we need to modify the code a bit后来我发现除了使用tile方法我们也可以使用SciPy_coo矩阵,它比上面的方法稍微快一点,但我们需要稍微修改一下代码

from scipy.sparse import coo_matrix
users = {}
businesses = {}
ratings = {}
row = []
col = []
data = []

for entry in yelp_entries:
    if entry['business_id'] not in businesses:
        businesses[entry['business_id']] = len(businesses)
    if entry['user_id'] not in users:
        users[entry['user_id']] = len(users)
    col.append(businesses[review['business_id']])
    row.append(users[review['user_id']])
    data.append(int(review['stars']))

matrix = coo_matrix((data, (row, col))).toarray()

note : Later i found out the reason why i can't .append() or .add() to ratings variable is because注意:后来我发现我不能 .append() 或 .add() 到 ratings 变量的原因是因为

ratings = {}

counts as dict data type, to declare a set data type you should use this instead:算作 dict 数据类型,要声明一个集合数据类型,您应该使用它:

ratings = set()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM