In Python3 and Pandas I have this program to make word cloud from a column:
import pandas as pd
import numpy as np
from wordcloud import WordCloud
import matplotlib.pyplot as plt
autores_atuais = pd.read_csv("deputados_autores_projetos.csv", sep=',',encoding = 'utf-8', converters={'IdAutor': lambda x: str(x), 'IdDocumento': lambda x: str(x), 'CodOriginalidade': lambda x: str(x), 'IdNatureza': lambda x: str(x), 'NroLegislativo': lambda x: str(x)})
autores_atuais.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6632 entries, 74057 to 84859
Data columns (total 10 columns):
IdAutor 6632 non-null object
IdDocumento 6632 non-null object
NomeAutor 6632 non-null object
AnoLegislativo 6632 non-null object
CodOriginalidade 5295 non-null object
DtEntradaSistema 6632 non-null object
DtPublicacao 6632 non-null object
Ementa 6632 non-null object
IdNatureza 6632 non-null object
NroLegislativo 6632 non-null object
dtypes: object(10)
memory usage: 569.9+ KB
wordcloud = WordCloud().generate(' '.join(autores_atuais['Ementa']))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
Please, how can I ignore some words from the cloud? For example, small words ("de", "ao") and certain words ("Estado")
To drop short words (say, 2 or less), you can use
autores_atuais = autores_atuais[autores_atuais.Ementa.str.len() <= 2]
To drop words in a list (say restricted = ['Estado']
), you can use
autores_atuais = autores_atuais[~autores_atuais.Ementa.isin(restricted)]
I think need boolean indexing
with ~
for inverse condition with isin
for filter list of words with str.len
for filter by length of words and if necessary chain conditions by |
:
autores_atuais = pd.DataFrame({'Ementa':['Estado','another','be','de','def','bax']})
print (autores_atuais)
Ementa
0 Estado
1 another
2 be
3 de
4 def
5 bax
m1 = autores_atuais['Ementa'].isin(['Estado','another','next'])
m2 = autores_atuais['Ementa'].str.len() < 3
s = autores_atuais.loc[~(m1 | m2), 'Ementa']
print (s)
4 def
5 bax
Name: Ementa, dtype: object
Similar alternative with &
for AND
and inverse first condition by ~
and second by >=
:
m1 = ~autores_atuais['Ementa'].isin(['Estado','another','next'])
m2 = autores_atuais['Ementa'].str.len() >= 3
s = autores_atuais.loc[m1 & m2, 'Ementa']
print (s)
4 def
5 bax
Name: Ementa, dtype: object
wordcloud = WordCloud().generate(' '.join(s))
I think you're using amueller's wordcloud
module? If so, there is a stopwords
parameter which allows you to specify a file containing words to exclude.
So for example, if you create a textfile called stopwords.txt
and save it in the same folder as your csv file, with this containing:
de
ao
Estado
And then change to:
wordcloud = WordCloud(stopwords='stopwords.txt').generate(' '.join(autores_atuais['Ementa']))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
These words should correctly be excluded. The default set of words to be excluded is contained in the module folder, and should be called stopwords
. If you're frequently going to be running into the same issues, it may be helpful to modify this default file here.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.