如何忽略Python中词云中的某些单词？

Question

In Python3 and Pandas I have this program to make word cloud from a column: 在Python3和Pandas中，我有一个程序可以从列中创建文字云：

import pandas as pd
import numpy as np
from wordcloud import WordCloud
import matplotlib.pyplot as plt

autores_atuais = pd.read_csv("deputados_autores_projetos.csv", sep=',',encoding = 'utf-8', converters={'IdAutor': lambda x: str(x), 'IdDocumento': lambda x: str(x), 'CodOriginalidade': lambda x: str(x), 'IdNatureza': lambda x: str(x), 'NroLegislativo': lambda x: str(x)})

autores_atuais.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6632 entries, 74057 to 84859
Data columns (total 10 columns):
IdAutor             6632 non-null object
IdDocumento         6632 non-null object
NomeAutor           6632 non-null object
AnoLegislativo      6632 non-null object
CodOriginalidade    5295 non-null object
DtEntradaSistema    6632 non-null object
DtPublicacao        6632 non-null object
Ementa              6632 non-null object
IdNatureza          6632 non-null object
NroLegislativo      6632 non-null object
dtypes: object(10)
memory usage: 569.9+ KB


wordcloud = WordCloud().generate(' '.join(autores_atuais['Ementa']))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

Please, how can I ignore some words from the cloud? 请问，我怎么能忽略云中的一些词呢？ For example, small words ("de", "ao") and certain words ("Estado") 例如，小词（“de”，“ao”）和某些词（“Estado”）

Answer 1

To drop short words (say, 2 or less), you can use 要删除简短的单词（例如，2或更少），您可以使用

autores_atuais = autores_atuais[autores_atuais.Ementa.str.len() <= 2]

To drop words in a list (say restricted = ['Estado'] ), you can use 要删除列表中的单词（例如restricted = ['Estado'] ），您可以使用

autores_atuais = autores_atuais[~autores_atuais.Ementa.isin(restricted)]

Answer 2

I think need boolean indexing with ~ for inverse condition with isin for filter list of words with str.len for filter by length of words and if necessary chain conditions by | 我认为需要boolean indexing与~与逆条件isin对文字的过滤列表str.len用语言的长度，如果必要链条件过滤器| : ：

autores_atuais = pd.DataFrame({'Ementa':['Estado','another','be','de','def','bax']})

print (autores_atuais)
    Ementa
0   Estado
1  another
2       be
3       de
4      def
5      bax

m1 = autores_atuais['Ementa'].isin(['Estado','another','next'])
m2 = autores_atuais['Ementa'].str.len() < 3

s = autores_atuais.loc[~(m1 | m2), 'Ementa']
print (s)
4    def
5    bax
Name: Ementa, dtype: object

Similar alternative with & for AND and inverse first condition by ~ and second by >= : 类似的替代方案用&为AND和反向第一条件用~和第二个用>= ：

m1 = ~autores_atuais['Ementa'].isin(['Estado','another','next'])
m2 = autores_atuais['Ementa'].str.len() >= 3

s = autores_atuais.loc[m1 & m2, 'Ementa']
print (s)
4    def
5    bax
Name: Ementa, dtype: object

wordcloud = WordCloud().generate(' '.join(s))

Answer 3

I think you're using amueller's wordcloud module? 我想你正在使用amueller的wordcloud模块？ If so, there is a stopwords parameter which allows you to specify a file containing words to exclude. 如果是，则有一个stopwords参数，允许您指定包含要排除的单词的文件。

So for example, if you create a textfile called stopwords.txt and save it in the same folder as your csv file, with this containing: 因此，例如，如果您创建一个名为stopwords.txt的文本文件并将其保存在与csv文件相同的文件夹中，其中包含：

de
ao
Estado

And then change to: 然后改为：

wordcloud = WordCloud(stopwords='stopwords.txt').generate(' '.join(autores_atuais['Ementa']))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

These words should correctly be excluded. 应正确排除这些词语。 The default set of words to be excluded is contained in the module folder, and should be called stopwords . 要排除的默认单词集包含在模块文件夹中，应该称为stopwords 。 If you're frequently going to be running into the same issues, it may be helpful to modify this default file here. 如果您经常遇到相同的问题，在此处修改此默认文件可能会有所帮助。

如何忽略Python中词云中的某些单词？

问题描述

3 个解决方案

解决方案1
3 2018-05-11 12:49:52

解决方案2
1 已采纳 2018-05-11 12:49:30

解决方案3
1 2018-05-11 13:11:21

如何忽略Python中词云中的某些单词？

问题描述

3 个解决方案

解决方案1 3 2018-05-11 12:49:52

解决方案2 1 已采纳 2018-05-11 12:49:30

解决方案3 1 2018-05-11 13:11:21

解决方案1
3 2018-05-11 12:49:52

解决方案2
1 已采纳 2018-05-11 12:49:30

解决方案3
1 2018-05-11 13:11:21