简体   繁体   中英

How to generate wordcloud of bangla text in python?

I tried the code below :

!pip install python-bidi
from wordcloud import WordCloud
from matplotlib import pyplot as plt
from bidi.algorithm import get_display

text="""মুস্তাফিজ"""

bidi_text = get_display(text)
print(bidi_text)
# https://github.com/amueller/word_cloud/issues/367
# https://stackoverflow.com/questions/54063438/create-wordcloud-in-python-for-foreign-language-hebrew
# https://www.omicronlab.com/bangla-fonts.html
rgx = r"[\u0980-\u09FF]+"
wordcloud = WordCloud(font_path='/content/Siyamrupali.ttf').generate(bidi_text)

#wordcloud = WordCloud(font_path='/content/FreeSansBold.ttf').generate(bidi_text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

then i get this error :

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-87-56d899c0de07> in <module>()
     12 # https://www.omicronlab.com/bangla-fonts.html
     13 rgx = r"[\u0980-\u09FF]+"
---> 14 wordcloud = WordCloud(font_path='/content/Siyamrupali.ttf').generate(bidi_text)
     15 
     16 #wordcloud = WordCloud(font_path='/content/FreeSansBold.ttf').generate(bidi_text)

2 frames
/usr/local/lib/python3.6/dist-packages/wordcloud/wordcloud.py in generate_from_frequencies(self, frequencies, max_font_size)
    381         if len(frequencies) <= 0:
    382             raise ValueError("We need at least 1 word to plot a word cloud, "
--> 383                              "got %d." % len(frequencies))
    384         frequencies = frequencies[:self.max_words]
    385 

ValueError: We need at least 1 word to plot a word cloud, got 0.

this line is not picking bangla words : wordcloud = WordCloud(font_path='/content/Siyamrupali.ttf').generate(bidi_text)

i tried almost all the fonts from here for bangla language : https://www.omicronlab.com/bangla-fonts.html

nothing works

You didn't change regexp with your defined one in the word cloud. While processing the text in the word cloud, it couldn't match the pattern and returned an empty list. Passing rgx variable while creating a word cloud object will solve your issue.

wordcloud = WordCloud(font_path='/content/Siyamrupali.ttf',regexp=rgx).generate(bidi_text)

Here is the full snippet of the code.

!pip install python-bidi
from wordcloud import WordCloud
from matplotlib import pyplot as plt
from bidi.algorithm import get_display

text="""মুস্তাফিজ"""

bidi_text = get_display(text)
print(bidi_text)
# https://github.com/amueller/word_cloud/issues/367
# https://stackoverflow.com/questions/54063438/create-wordcloud-in-python-for-foreign-language-hebrew
# https://www.omicronlab.com/bangla-fonts.html
rgx = r"[\u0980-\u09FF]+"
wordcloud = WordCloud(font_path='/content/Siyamrupali.ttf', 
regexp=rgx).generate(bidi_text)

#wordcloud = WordCloud(font_path='/content/FreeSansBold.ttf').generate(bidi_text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

I have generated a word cloud in Bangla using the following code. You can try it out:

def generate_Word_cloud(self,author_post, vocabularyWordnumber, img_file, stop_word_root_path):

stop_word_file = stop_word_root_path+'/stopwords-bn.txt'
print(stop_word_file)
f = open(stop_word_file, "r", encoding="utf8")
stop_word = f.read().split("\n")
print(stop_word)

final_text = " ".join(author_post)
print(final_text)
wordcloud = WordCloud(stopwords = stop_word, font_path='/usr/share/fonts/truetype/freefont/kalpurush.ttf',
    width = 600, height = 500,max_font_size=300, max_words=vocabularyWordnumber,
                      min_word_length=4, background_color="black").generate(final_text)
wordcloud.to_file(img_file)

I followed this comment and could solve the problem in Ubuntu eventually.

Step 1 : !sudo apt-get install libfreetype6-dev libharfbuzz-dev libfribidi-dev gtk-doc-tools

Step 2 : !wget -O raqm-0.7.0.tar.gz https://raw.githubusercontent.com/python-pillow/pillow-depends/master/raqm-0.7.0.tar.gz

Now the raqm-0.7.0.tar.gz file should be in your downloads section.

Step 3 : !tar -xzvf raqm-0.7.0.tar.gz

Step 4 : !cd raqm-0.7.0

Step 5 : !./configure --prefix=/usr && make -j4 && sudo make -j4 install

Step 6 : Now you just have to reinstall the Pillow library. Activate the correct environment. Then run the following commands:

python3 -m pip install --upgrade pip python3 -m pip install --upgrade Pillow

That's it! Now you have a working Pillow library that can produce proper Bengali and other Indic fonts in the image.

Also, as suggested by @Farzana Eva in her comment, you need to pass the rgx variable in the wordcloud object.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM