[英]Unable to tokenize sentences using gensim and nltk in python
我有一個標題列表:
> print(data)
>
0 Manager
1 Electrician
3 Carpenter
4 Electrician & Carpenter
...
我正在嘗試使用gensim查找最接近的相關標題。
我的代碼是:
import os
import pandas as pd
import nltk
import gensim
from gensim import corpora, models, similarities
from nltk.tokenize import word_tokenize
df = pd.read_csv('df.csv')
corpus = pd.DataFrame(df, columns=['Job Title'])
tokenized_sents = [word_tokenize(i) for i in corpus]
model = gensim.models.Word2Vec(tokenized_sents, min_count=1)
model.most_similar("Electrician")
當我運行標記化以將每個標題標記為一個句子(tokenized_sents變量)時,它僅標記化標題:
> tokenzied_sents
> [['Job', 'Title']]
我究竟做錯了什么?
當您遍歷pd.DataFrame
,它返回其列 :
In [9]: df = pd.DataFrame(np.random.randint(0,10, (10,3)), columns=list('abc'))
In [10]: df
Out[10]:
a b c
0 0 7 3
1 5 0 5
2 7 7 9
3 2 0 0
4 6 9 2
5 8 5 2
6 5 0 2
7 3 2 5
8 4 8 6
9 0 5 1
In [11]: [c for c in df]
Out[11]: ['a', 'b', 'c']
我想您可能想要:
[word_tokenize(i) for i in corpus['Job Title']]
由於您要遍歷'Job Title'
列中的值,因此:
In [12]: [c for c in df['a']]
Out[12]: [0, 5, 7, 2, 6, 8, 5, 3, 4, 0]
In [13]: [c + 10 for c in df['a']]
Out[13]: [10, 15, 17, 12, 16, 18, 15, 13, 14, 10]
雖然,你很可能只是免去pandas
干脆,因為gensim
往往與懶惰流工作,你只是用pandas
來讀取一個CSV文件中,據我可以告訴。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.