[英]Implementation of n-grams in python code for multi-class text classification
我是python的新手,正在从事建筑行业合同文件的多类文本分类。 我在代码中实现n-gram时遇到了问题,这些代码是我从不同的在线资源获得帮助而生成的。 我想在我的代码中实现unigram,bi-gram和tri-gram。 在这方面的任何帮助将受到高度赞赏。
我在我的代码的Tfidf部分尝试了bigram和trigram,但是它正在工作。
df = pd.read_csv('projectdataayes.csv')
df = df[pd.notnull(df['types'])]
my_types = ['Requirement','Non-Requirement']
#converting to lower case
df['description'] = df.description.map(lambda x: x.lower())
#Removing the punctuation
df['description'] = df.description.str.replace('[^\w\s]', '')
#splitting the word into tokens
df['description'] = df['description'].apply(tokenize.word_tokenize)
#stemming
stemmer = PorterStemmer()
df['description'] = df['description'].apply(lambda x: [stemmer.stem(y) for y in x])
print(df[:10])
## This converts the list of words into space-separated strings
df['description'] = df['description'].apply(lambda x: ' '.join(x))
count_vect = CountVectorizer()
counts = count_vect.fit_transform(df['description'])
X_train, X_test, y_train, y_test = train_test_split(counts, df['types'], test_size=0.3, random_state=39)
tfidf_vect_ngram = TfidfVectorizer(analyzer='word',
token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=5000)
tfidf_vect_ngram.fit(df['description'])
X_train_Tfidf = tfidf_vect_ngram.transform(X_train)
X_test_Tfidf = tfidf_vect_ngram.transform(X_test)
model = MultinomialNB().fit(X_train, y_train)
文件“ C:\\ Users \\ fhassan \\ anaconda3 \\ lib \\ site-packages \\ sklearn \\ feature_extraction \\ text.py”,第328行,位于tokenize(preprocess(self.decode(doc))),stop_words中)
文件“ C:\\ Users \\ fhassan \\ anaconda3 \\ lib \\ site-packages \\ sklearn \\ feature_extraction \\ text.py”,行256,返回lambda x:strip_accents(x.lower())
文件“C:\\用户\\ fhassan \\ anaconda3 \\ LIB \\站点包\\ SciPy的\\稀疏\\ base.py”,线路686,在GETATTR提高AttributeError的(ATTR + “找不到”)
AttributeError:找不到更低的值
首先,您需要在文本上使用矢量化器:
tfidf_vect_ngram.fit(df['description'])
然后尝试将其应用于计数:
counts = count_vect.fit_transform(df['description'])
X_train, X_test, y_train, y_test = train_test_split(counts, df['types'], test_size=0.3, random_state=39)
tfidf_vect_ngram.transform(X_train)
您需要将vectorizer应用于文本,而不是计数:
X_train, X_test, y_train, y_test = train_test_split(df['description'], df['types'], test_size=0.3, random_state=39)
tfidf_vect_ngram.transform(X_train)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.