简体   繁体   English

如何在python中生成孟加拉文本的wordcloud?

[英]How to generate wordcloud of bangla text in python?

I tried the code below :我试过下面的代码:

!pip install python-bidi
from wordcloud import WordCloud
from matplotlib import pyplot as plt
from bidi.algorithm import get_display

text="""মুস্তাফিজ"""

bidi_text = get_display(text)
print(bidi_text)
# https://github.com/amueller/word_cloud/issues/367
# https://stackoverflow.com/questions/54063438/create-wordcloud-in-python-for-foreign-language-hebrew
# https://www.omicronlab.com/bangla-fonts.html
rgx = r"[\u0980-\u09FF]+"
wordcloud = WordCloud(font_path='/content/Siyamrupali.ttf').generate(bidi_text)

#wordcloud = WordCloud(font_path='/content/FreeSansBold.ttf').generate(bidi_text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

then i get this error :然后我得到这个错误:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-87-56d899c0de07> in <module>()
     12 # https://www.omicronlab.com/bangla-fonts.html
     13 rgx = r"[\u0980-\u09FF]+"
---> 14 wordcloud = WordCloud(font_path='/content/Siyamrupali.ttf').generate(bidi_text)
     15 
     16 #wordcloud = WordCloud(font_path='/content/FreeSansBold.ttf').generate(bidi_text)

2 frames
/usr/local/lib/python3.6/dist-packages/wordcloud/wordcloud.py in generate_from_frequencies(self, frequencies, max_font_size)
    381         if len(frequencies) <= 0:
    382             raise ValueError("We need at least 1 word to plot a word cloud, "
--> 383                              "got %d." % len(frequencies))
    384         frequencies = frequencies[:self.max_words]
    385 

ValueError: We need at least 1 word to plot a word cloud, got 0. ValueError:我们需要至少 1 个词来绘制词云,得到 0。

this line is not picking bangla words : wordcloud = WordCloud(font_path='/content/Siyamrupali.ttf').generate(bidi_text)这一行不是选择孟加拉语单词:wordcloud = WordCloud(font_path='/content/Siyamrupali.ttf').generate(bidi_text)

i tried almost all the fonts from here for bangla language : https://www.omicronlab.com/bangla-fonts.html我尝试了几乎所有的孟加拉语字体: https ://www.omicronlab.com/bangla-fonts.html

nothing works没有任何效果

You didn't change regexp with your defined one in the word cloud.您没有使用您在词云中定义的正则表达式更改正则表达式 While processing the text in the word cloud, it couldn't match the pattern and returned an empty list.在处理词云中的文本时,它无法匹配模式并返回一个空列表。 Passing rgx variable while creating a word cloud object will solve your issue.在创建词云对象时传递rgx变量将解决您的问题。

wordcloud = WordCloud(font_path='/content/Siyamrupali.ttf',regexp=rgx).generate(bidi_text)

Here is the full snippet of the code.这是代码的完整片段。

!pip install python-bidi
from wordcloud import WordCloud
from matplotlib import pyplot as plt
from bidi.algorithm import get_display

text="""মুস্তাফিজ"""

bidi_text = get_display(text)
print(bidi_text)
# https://github.com/amueller/word_cloud/issues/367
# https://stackoverflow.com/questions/54063438/create-wordcloud-in-python-for-foreign-language-hebrew
# https://www.omicronlab.com/bangla-fonts.html
rgx = r"[\u0980-\u09FF]+"
wordcloud = WordCloud(font_path='/content/Siyamrupali.ttf', 
regexp=rgx).generate(bidi_text)

#wordcloud = WordCloud(font_path='/content/FreeSansBold.ttf').generate(bidi_text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

I have generated a word cloud in Bangla using the following code.我使用以下代码在孟加拉语中生成了一个词云。 You can try it out:你可以试试:

def generate_Word_cloud(self,author_post, vocabularyWordnumber, img_file, stop_word_root_path): def generate_Word_cloud(self,author_post,vocabularyWordnumber,img_file,stop_word_root_path):

stop_word_file = stop_word_root_path+'/stopwords-bn.txt'
print(stop_word_file)
f = open(stop_word_file, "r", encoding="utf8")
stop_word = f.read().split("\n")
print(stop_word)

final_text = " ".join(author_post)
print(final_text)
wordcloud = WordCloud(stopwords = stop_word, font_path='/usr/share/fonts/truetype/freefont/kalpurush.ttf',
    width = 600, height = 500,max_font_size=300, max_words=vocabularyWordnumber,
                      min_word_length=4, background_color="black").generate(final_text)
wordcloud.to_file(img_file)

I followed this comment and could solve the problem in Ubuntu eventually.我遵循了这条评论,最终可以解决 Ubuntu 中的问题。

Step 1 : !sudo apt-get install libfreetype6-dev libharfbuzz-dev libfribidi-dev gtk-doc-tools第 1 步:!sudo apt-get install libfreetype6-dev libharfbuzz-dev libfribidi-dev gtk-doc-tools

Step 2 : !wget -O raqm-0.7.0.tar.gz https://raw.githubusercontent.com/python-pillow/pillow-depends/master/raqm-0.7.0.tar.gz第 2 步:!wget -O raqm-0.7.0.tar.gz https://raw.githubusercontent.com/python-pillow/pillow-depends/master/raqm-0.7.0.tar.gz

Now the raqm-0.7.0.tar.gz file should be in your downloads section.现在 raqm-0.7.0.tar.gz 文件应该在您的下载部分。

Step 3 : !tar -xzvf raqm-0.7.0.tar.gz第 3 步:!tar -xzvf raqm-0.7.0.tar.gz

Step 4 : !cd raqm-0.7.0第 4 步:!cd raqm-0.7.0

Step 5 : !./configure --prefix=/usr && make -j4 && sudo make -j4 install第 5 步:!./configure --prefix=/usr && make -j4 && sudo make -j4 install

Step 6 : Now you just have to reinstall the Pillow library.第 6 步:现在您只需重新安装 Pillow 库。 Activate the correct environment.激活正确的环境。 Then run the following commands:然后运行以下命令:

python3 -m pip install --upgrade pip python3 -m pip install --upgrade Pillow python3 -m pip install --upgrade pip python3 -m pip install --upgrade Pillow

That's it!就是这样! Now you have a working Pillow library that can produce proper Bengali and other Indic fonts in the image.现在你有一个可用的 Pillow 库,可以在图像中生成适当的孟加拉语和其他印度语字体。

Also, as suggested by @Farzana Eva in her comment, you need to pass the rgx variable in the wordcloud object.此外,正如@Farzana Eva 在她的评论中所建议的,您需要在 wordcloud 对象中传递 rgx 变量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM