python 中的错误：索引 0 超出尺寸为 0 的轴 0 的范围

Question

Today i want to learn about how to code a content based filtering in python, and so i search some code and i apply it.今天我想了解如何在 python 中编写基于内容的过滤代码，所以我搜索了一些代码并应用了它。 I have a simple dataset contains a hotel dataset, with the name, address, and description.我有一个简单的数据集，其中包含一个酒店数据集，包含名称、地址和描述。 After i tried the code, its said index 0 is out of bounds for axis 0 with size 0 at the end of the code.在我尝试了代码之后，它所说的索引 0 超出了代码末尾大小为 0 的轴 0 的范围。 Here's the code:这是代码：

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import re
import random

data = pd.read_csv('hotel.csv')
data.head()

the output: nama alamat deskripsi 0 Capital O 253 Topas Galeria Hotel Jl. output: nama alamat deskripsi 0 Capital O 253 Topas Galeria Hotel Jl. Dr. Djundjunan No. 153, 40173 Bandung, Ind... Berjarak 10 menit berkendara dari Bandara Inte... 1 Sheraton Bandung Hotel & Towers Jl. Dr. Djundjunan No. 153, 40173 Bandung, Ind... Berjarak 10 menit berkendara dari Bandara Inte... 1 Sheraton Bandung Hotel & Towers Jl. Ir H Juanda 390, 40135 Bandung, Indonesia Sheraton Hotel & Towers menawarkan akomodasi b... 2 OYO 794 Ln 9 Bandung Residence Jalan Lemahnendeut No 9, Sukajadi, 40164 Bandu... Berlokasi nyaman di Sukajadi, Bandung, OYO 794... 3 OYO 226 LJ hotel Jl. Ir H Juanda 390, 40135 Bandung, Indonesia Sheraton Hotel & Towers menawarkan akomodasi b... 2 OYO 794 Ln 9 Bandung Residence Jalan Lemahnendeut No 9, Sukajadi, 40164 Bandu... Berlokasi nyaman di Sukajadi, Bandung, OYO 794... 3 OYO 226 LJ 酒店 Jl。 Malabar No.2, Malabar, Lengkong, Dago, Asi... OYO 226 LJ hotel di Bandung, Jawa Barat, tepat... 4 OYO 230 Maleo Residence JI. Malabar No.2, Malabar, Lengkong, Dago, Asi... OYO 226 LJ hotel di Bandung, Jawa Barat, tepat... 4 OYO 230 Maleo Residence JI。 Dangeur Indah II No. 15, Sukagalih, Sukaja... OYO 230 Maleo Residence menawarkan akomodasi b... Dangeur Indah II No. 15, Sukagalih, Sukaja... OYO 230 Maleo Residence menawarkan akomodasi b...

data.describe()
data.info()

the output: output：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   nama       90 non-null     object
 1   alamat     90 non-null     object
 2   deskripsi  90 non-null     object
dtypes: object(3)
memory usage: 2.2+ KB

clean_spcl = re.compile('[/(){}\[\]\|@,;]')
clean_symbol = re.compile('[^0-9a-z #+_]')
stopworda = set(stopwords.words('english'))

def clean_text(text):
    text = text.lower() 
    text = clean_spcl.sub(' ', text)
    text = clean_symbol.sub('', text)
    text = ' '.join(word for word in text.split() if word not in stopworda) # hapus stopword dari kolom deskripsi
    return text
  
data['deskripsi_new'] = data['deskripsi'].apply(clean_text)

def pt_desc(index):
    example = data[data.index == index][['deskripsi_new', 'nama', 'alamat']].values[0]
    if len(example) > 0:
        print(example[0])
        print('Nama:', example[1])
        print('Alamat:', example[2])   

data.set_index('nama', inplace=True)
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(data['deskripsi_new'])
cos_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
cos_sim

the output: output：

array([[1.        , 0.07106689, 0.03075961, ..., 0.07474134, 0.0732575 ,
        0.01680878],
       [0.07106689, 1.        , 0.03508807, ..., 0.05947269, 0.08705608,
        0.01986701],
       [0.03075961, 0.03508807, 1.        , ..., 0.09113962, 0.05879732,
        0.06808138],
       ...,
       [0.07474134, 0.05947269, 0.09113962, ..., 1.        , 0.06321301,
        0.02205802],
       [0.0732575 , 0.08705608, 0.05879732, ..., 0.06321301, 1.        ,
        0.02245328],
       [0.01680878, 0.01986701, 0.06808138, ..., 0.02205802, 0.02245328,
        1.        ]])

indices = pd.Series(data.index)
indices[:50]

def rekomendasi(nama, cos_sim = cos_sim):
    
    rec = []
    
    idx = indices[indices == nama].index[0]

    score_series = pd.Series(cos_sim[idx]).sort_values(ascending = False)

    top_10_indexes = list(score_series.iloc[1:11].index)
    
    for i in top_10_indexes:
        recommended_news.append(list(data.index)[i])
        
    return rec

rekomendasi('Hotel') # and when i reach here, the error said 'index 0 is out of bounds for axis 0 with size 0'

what went wrong here?这里出了什么问题？

Answer 1

From what I understand you are trying to build a kind of search engine, which given a search vector will return the 10 best matching results.据我了解，您正在尝试构建一种搜索引擎，在给定搜索向量的情况下，该搜索引擎将返回 10 个最佳匹配结果。

If this is the case, you'll need to modify your rekomendasi function so that it will:如果是这种情况，您需要修改rekomendasi function 以便它：

process the input query vector处理输入查询向量
compute the similarity scores with the corpus (corpus here mean the list of your hotel descriptions)计算与语料库的相似度分数（这里的语料库是指您的酒店描述列表）
return the 10 items with highest similarity scores返回具有最高相似度分数的 10 个项目

I've modified your code to do that:我已经修改了你的代码来做到这一点：

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import re
import random

data = pd.read_csv('../../../../Downloads/hotel.csv')

clean_spcl = re.compile('[/(){}\[\]\|@,;]')
clean_symbol = re.compile('[^0-9a-z #+_]')
stopworda = set(stopwords.words('english'))

def clean_text(text):
    text = text.lower() 
    text = clean_spcl.sub(' ', text)
    text = clean_symbol.sub('', text)
    text = ' '.join(word for word in text.split() if word not in stopworda) # hapus stopword dari kolom deskripsi
    return text
  
data['deskripsi_new'] = data['deskripsi'].apply(clean_text)

def pt_desc(index):
    example = data[data.index == index][['deskripsi_new', 'nama', 'alamat']].values[0]
    if len(example) > 0:
        print(example[0])
        print('Nama:', example[1])
        print('Alamat:', example[2])   

data.set_index('nama', inplace=True)
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(data['deskripsi_new'])
cos_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
cos_sim

def rekomendasi(nama, cos_sim=cos_sim):
    
    # you first need to preprocess the given query text (i.e. nama) and transform it in tf=idf vector
    nama = clean_text(nama)
    nama_vector = tf.transform([nama])
    
    # Next we compute similarity scores between the query text (nama) and the corpus (tfidf_matrix)
    similarity_scores = cosine_similarity(nama_vector, tfidf_matrix).squeeze()
    top_10_indices = similarity_scores.argsort()[-10:][::-1]
    
    rec = data.index[top_10_indices].tolist()
    return rec

Example:例子：

rekomendasi('Hotel')

['The Trans Luxury Hotel Bandung', 'M Premiere Hotel Dago Bandung', 'Mutiara Hotel', 'éL Hotel Royale Bandung', 'The Jayakarta Suites Bandung, Hotel & Spa', 'Hotel Cemerlang', "OYO 167 Dago's Hill Hotel", "OYO 167 Dago's Hill Hotel", 'HARRIS Hotel & Conventions Ciumbuleuit – Bandung', 'Padma Hotel Bandung']

python 中的错误：索引 0 超出尺寸为 0 的轴 0 的范围

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-11-13 16:14:09

python 中的错误：索引 0 超出尺寸为 0 的轴 0 的范围

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-11-13 16:14:09

解决方案1
0 已采纳 2022-11-13 16:14:09