python代码中的n-gram实现多类文本分类

Question

我是python的新手，正在从事建筑行业合同文件的多类文本分类。 我在代码中实现n-gram时遇到了问题，这些代码是我从不同的在线资源获得帮助而生成的。 我想在我的代码中实现unigram，bi-gram和tri-gram。 在这方面的任何帮助将受到高度赞赏。

我在我的代码的Tfidf部分尝试了bigram和trigram，但是它正在工作。

    df = pd.read_csv('projectdataayes.csv')
    df = df[pd.notnull(df['types'])]
    my_types = ['Requirement','Non-Requirement']

    #converting to lower case
    df['description'] = df.description.map(lambda x: x.lower()) 

    #Removing the punctuation
    df['description'] = df.description.str.replace('[^\w\s]', '')  

    #splitting the word into tokens
    df['description'] = df['description'].apply(tokenize.word_tokenize) 

    #stemming
    stemmer = PorterStemmer()
    df['description'] = df['description'].apply(lambda x: [stemmer.stem(y) for y in x]) 

    print(df[:10])

    ## This converts the list of words into space-separated strings
    df['description'] = df['description'].apply(lambda x: ' '.join(x))
    count_vect = CountVectorizer()  
    counts = count_vect.fit_transform(df['description']) 


    X_train, X_test, y_train, y_test = train_test_split(counts, df['types'], test_size=0.3, random_state=39) 

    tfidf_vect_ngram = TfidfVectorizer(analyzer='word', 
    token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=5000)
    tfidf_vect_ngram.fit(df['description'])
    X_train_Tfidf =  tfidf_vect_ngram.transform(X_train)
    X_test_Tfidf =  tfidf_vect_ngram.transform(X_test)

    model = MultinomialNB().fit(X_train, y_train)

文件“ C：\\ Users \\ fhassan \\ anaconda3 \\ lib \\ site-packages \\ sklearn \\ feature_extraction \\ text.py”，第328行，位于tokenize（preprocess（self.decode（doc））），stop_words中）

文件“ C：\\ Users \\ fhassan \\ anaconda3 \\ lib \\ site-packages \\ sklearn \\ feature_extraction \\ text.py”，行256，返回lambda x：strip_accents（x.lower（））

文件“C：\\用户\\ fhassan \\ anaconda3 \\ LIB \\站点包\\ SciPy的\\稀疏\\ base.py”，线路686，在GETATTR提高AttributeError的（ATTR + “找不到”）

AttributeError：找不到更低的值

Answer 1

首先，您需要在文本上使用矢量化器：

tfidf_vect_ngram.fit(df['description'])

然后尝试将其应用于计数：

counts = count_vect.fit_transform(df['description'])
X_train, X_test, y_train, y_test = train_test_split(counts, df['types'], test_size=0.3, random_state=39) 
tfidf_vect_ngram.transform(X_train)

您需要将vectorizer应用于文本，而不是计数：

X_train, X_test, y_train, y_test = train_test_split(df['description'], df['types'], test_size=0.3, random_state=39)
tfidf_vect_ngram.transform(X_train)

python代码中的n-gram实现多类文本分类

问题描述

1 个解决方案

解决方案1
0 2019-04-07 03:09:14

python代码中的n-gram实现多类文本分类

问题描述

1 个解决方案

解决方案1 0 2019-04-07 03:09:14

解决方案1
0 2019-04-07 03:09:14