![](/img/trans.png)
[英]Efficient way of calculating similarity matrix for huge dataset in python
[英]Looking for efficient way to build matrix from yelp review dataset in python
目前我正在尋找一種有效的方法來為 Python 中的推薦系統構建評分矩陣。
矩陣應如下所示:
4|0|0|
5|2|0|
5|0|0|
4|0|0|
4|0|0|
4|0|0|
4|4|0|
2|0|0|
0|4|0|
0|3|0|
0|0|3|
0|0|5|
0|0|4|
具體來說,列是business_id
,行是user_id
|bus-1|bus-2|
user-1|stars|stars|
user-2|stars|stars|
目前我正在使用存儲在 MongoDB 中的這個 Yelp 評論數據集:
_id: "----X0BIDP9tA49U3RvdSQ"
user_id: "gVmUR8rqUFdbSeZbsg6z_w"
business_id: "Ue6-WhXvI-_1xUIuapl0zQ"
stars: 4
useful: 1
funny: 0
cool: 0
text: "Red, white and bleu salad was super yum and a great addition to the me..."
date: "2014-02-17 16:48:49"
我的方法是從評論表中構建一個唯一的business_id
和user_id
列表,並再次在評論表中查詢這些值。
我已經在此處包含了我的代碼,正如您所看到的,由於采用了蠻力方法,就像我之前包含的那樣構建小矩陣需要很長時間。
這是我的代碼的一些片段:
def makeBisnisArray(cityNameParam):
arrayBisnis = []
#Append business id filtered by cityNameParam to the bisnis array
bisnisInCity = colBisnis.find({"city": cityNameParam})
for bisnis in bisnisInCity:
#if the business id is not in array, then append it to the array
if(not(bisnis in arrayBisnis)):
arrayBisnis.append(bisnis["_id"])
return arrayBisnis
def makeUserArray(bisnisName):
global arrayUser
#find review filtered by bisnisName
hslReview = colReview.find({"business_id": bisnisName})
for review in hslReview:
#if the user id is not already in array, append it to the array
if(not(review['user_id'] in arrayUser)):
arrayUser.append(review['user_id'])
def writeRatingMatrix(arrayBisnis, arrayUser):
f = open("file.txt", "w")
for user in arrayUser:
for bisnis in arrayBisnis:
#find one instance from the database by business_id and user_id
x = colReview.find_one({"business_id": bisnis, "user_id": user})
#if there's none, then just write the rating as 0
if x is None :
f.write('0|')
#if found, write the star value
else:
f.write((str(x['stars'])+"|"))
print()
f.write('\n')
def buildCityTable(cityName):
arrayBisnis = makeBisnisArray(cityName)
global arrayUser
for bisnis in arrayBisnis:
makeUserArray(bisnis)
writeRatingMatrix(arrayBisnis, arrayUser)
arrayUser = []
cityNameVar = 'Pointe-Aux-Trembles'
buildCityTable(cityNameVar)
誰能建議更有效的方法來為我構建評分矩陣?
您可以采用幾種通用方法來加快速度。
像這樣的東西
users = {}
businesses = {}
ratings = {}
for entry in yelp_entries:
if entry['user_id'] not in users:
users[entry['user_id']] = len(users)
if entry['business_id'] not in businesses:
businesses[entry['business_id']] = len(businesses)
ratings.append((
users[[entry['user_id']],
businesses[entry['business_id']],
entry['stars']
))
matrix = numpy.tile(0, (len(users), len(businesses))
for r in ratings:
matrix[r[0]][r[1]] = r[2]
我修改了@sirlark 的代碼以滿足我的需要,但由於某種原因,我不能在 ratings 上使用append 並在 ratings 中使用for r對其進行迭代,所以我不得不像這樣更改代碼
users = {}
businesses = {}
ratings = {}
#Query the yelp_entries for all reviews matching business_id and store it in businesses first
for entry in yelp_entries:
if entry['business_id'] not in businesses:
businesses[entry['business_id']] = len(businesses)
if entry['user_id'] not in users:
users[entry['user_id']] = len(users)
ratings[len(ratings)]=(users[entry['user_id']],
businesses[entry['business_id']],
int(entry['stars']))
matrix = numpy.tile(0, (len(users), len(businesses))
for ind in range(0,len(ratings)):
matrix[ratings[ind][0]][ratings[ind][1]] = ratings[ind][2]
后來我發現除了使用tile方法我們也可以使用SciPy_coo矩陣,它比上面的方法稍微快一點,但我們需要稍微修改一下代碼
from scipy.sparse import coo_matrix
users = {}
businesses = {}
ratings = {}
row = []
col = []
data = []
for entry in yelp_entries:
if entry['business_id'] not in businesses:
businesses[entry['business_id']] = len(businesses)
if entry['user_id'] not in users:
users[entry['user_id']] = len(users)
col.append(businesses[review['business_id']])
row.append(users[review['user_id']])
data.append(int(review['stars']))
matrix = coo_matrix((data, (row, col))).toarray()
注意:后來我發現我不能 .append() 或 .add() 到 ratings 變量的原因是因為
ratings = {}
算作 dict 數據類型,要聲明一個集合數據類型,您應該使用它:
ratings = set()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.