[英]error in python : index 0 is out of bounds for axis 0 with size 0
Today i want to learn about how to code a content based filtering in python, and so i search some code and i apply it.今天我想了解如何在 python 中编写基于内容的过滤代码,所以我搜索了一些代码并应用了它。 I have a simple dataset contains a hotel dataset, with the name, address, and description.
我有一个简单的数据集,其中包含一个酒店数据集,包含名称、地址和描述。 After i tried the code, its said index 0 is out of bounds for axis 0 with size 0 at the end of the code.
在我尝试了代码之后,它所说的索引 0 超出了代码末尾大小为 0 的轴 0 的范围。 Here's the code:
这是代码:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import re
import random
data = pd.read_csv('hotel.csv')
data.head()
the output: nama alamat deskripsi 0 Capital O 253 Topas Galeria Hotel Jl. output: nama alamat deskripsi 0 Capital O 253 Topas Galeria Hotel Jl. Dr. Djundjunan No. 153, 40173 Bandung, Ind... Berjarak 10 menit berkendara dari Bandara Inte... 1 Sheraton Bandung Hotel & Towers Jl.
Dr. Djundjunan No. 153, 40173 Bandung, Ind... Berjarak 10 menit berkendara dari Bandara Inte... 1 Sheraton Bandung Hotel & Towers Jl. Ir H Juanda 390, 40135 Bandung, Indonesia Sheraton Hotel & Towers menawarkan akomodasi b... 2 OYO 794 Ln 9 Bandung Residence Jalan Lemahnendeut No 9, Sukajadi, 40164 Bandu... Berlokasi nyaman di Sukajadi, Bandung, OYO 794... 3 OYO 226 LJ hotel Jl.
Ir H Juanda 390, 40135 Bandung, Indonesia Sheraton Hotel & Towers menawarkan akomodasi b... 2 OYO 794 Ln 9 Bandung Residence Jalan Lemahnendeut No 9, Sukajadi, 40164 Bandu... Berlokasi nyaman di Sukajadi, Bandung, OYO 794... 3 OYO 226 LJ 酒店 Jl。 Malabar No.2, Malabar, Lengkong, Dago, Asi... OYO 226 LJ hotel di Bandung, Jawa Barat, tepat... 4 OYO 230 Maleo Residence JI.
Malabar No.2, Malabar, Lengkong, Dago, Asi... OYO 226 LJ hotel di Bandung, Jawa Barat, tepat... 4 OYO 230 Maleo Residence JI。 Dangeur Indah II No. 15, Sukagalih, Sukaja... OYO 230 Maleo Residence menawarkan akomodasi b...
Dangeur Indah II No. 15, Sukagalih, Sukaja... OYO 230 Maleo Residence menawarkan akomodasi b...
data.describe()
data.info()
the output: output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 nama 90 non-null object
1 alamat 90 non-null object
2 deskripsi 90 non-null object
dtypes: object(3)
memory usage: 2.2+ KB
clean_spcl = re.compile('[/(){}\[\]\|@,;]')
clean_symbol = re.compile('[^0-9a-z #+_]')
stopworda = set(stopwords.words('english'))
def clean_text(text):
text = text.lower()
text = clean_spcl.sub(' ', text)
text = clean_symbol.sub('', text)
text = ' '.join(word for word in text.split() if word not in stopworda) # hapus stopword dari kolom deskripsi
return text
data['deskripsi_new'] = data['deskripsi'].apply(clean_text)
def pt_desc(index):
example = data[data.index == index][['deskripsi_new', 'nama', 'alamat']].values[0]
if len(example) > 0:
print(example[0])
print('Nama:', example[1])
print('Alamat:', example[2])
data.set_index('nama', inplace=True)
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(data['deskripsi_new'])
cos_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
cos_sim
the output: output:
array([[1. , 0.07106689, 0.03075961, ..., 0.07474134, 0.0732575 ,
0.01680878],
[0.07106689, 1. , 0.03508807, ..., 0.05947269, 0.08705608,
0.01986701],
[0.03075961, 0.03508807, 1. , ..., 0.09113962, 0.05879732,
0.06808138],
...,
[0.07474134, 0.05947269, 0.09113962, ..., 1. , 0.06321301,
0.02205802],
[0.0732575 , 0.08705608, 0.05879732, ..., 0.06321301, 1. ,
0.02245328],
[0.01680878, 0.01986701, 0.06808138, ..., 0.02205802, 0.02245328,
1. ]])
indices = pd.Series(data.index)
indices[:50]
def rekomendasi(nama, cos_sim = cos_sim):
rec = []
idx = indices[indices == nama].index[0]
score_series = pd.Series(cos_sim[idx]).sort_values(ascending = False)
top_10_indexes = list(score_series.iloc[1:11].index)
for i in top_10_indexes:
recommended_news.append(list(data.index)[i])
return rec
rekomendasi('Hotel') # and when i reach here, the error said 'index 0 is out of bounds for axis 0 with size 0'
what went wrong here?这里出了什么问题?
From what I understand you are trying to build a kind of search engine, which given a search vector will return the 10 best matching results.据我了解,您正在尝试构建一种搜索引擎,在给定搜索向量的情况下,该搜索引擎将返回 10 个最佳匹配结果。
If this is the case, you'll need to modify your rekomendasi
function so that it will:如果是这种情况,您需要修改
rekomendasi
function 以便它:
I've modified your code to do that:我已经修改了你的代码来做到这一点:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import re
import random
data = pd.read_csv('../../../../Downloads/hotel.csv')
clean_spcl = re.compile('[/(){}\[\]\|@,;]')
clean_symbol = re.compile('[^0-9a-z #+_]')
stopworda = set(stopwords.words('english'))
def clean_text(text):
text = text.lower()
text = clean_spcl.sub(' ', text)
text = clean_symbol.sub('', text)
text = ' '.join(word for word in text.split() if word not in stopworda) # hapus stopword dari kolom deskripsi
return text
data['deskripsi_new'] = data['deskripsi'].apply(clean_text)
def pt_desc(index):
example = data[data.index == index][['deskripsi_new', 'nama', 'alamat']].values[0]
if len(example) > 0:
print(example[0])
print('Nama:', example[1])
print('Alamat:', example[2])
data.set_index('nama', inplace=True)
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(data['deskripsi_new'])
cos_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
cos_sim
def rekomendasi(nama, cos_sim=cos_sim):
# you first need to preprocess the given query text (i.e. nama) and transform it in tf=idf vector
nama = clean_text(nama)
nama_vector = tf.transform([nama])
# Next we compute similarity scores between the query text (nama) and the corpus (tfidf_matrix)
similarity_scores = cosine_similarity(nama_vector, tfidf_matrix).squeeze()
top_10_indices = similarity_scores.argsort()[-10:][::-1]
rec = data.index[top_10_indices].tolist()
return rec
Example:例子:
rekomendasi('Hotel')
['The Trans Luxury Hotel Bandung', 'M Premiere Hotel Dago Bandung', 'Mutiara Hotel', 'éL Hotel Royale Bandung', 'The Jayakarta Suites Bandung, Hotel & Spa', 'Hotel Cemerlang', "OYO 167 Dago's Hill Hotel", "OYO 167 Dago's Hill Hotel", 'HARRIS Hotel & Conventions Ciumbuleuit – Bandung', 'Padma Hotel Bandung']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.