简体   繁体   English

使用Python在梵文(印度语)文本上进行主题建模

[英]Topic modeling on the Devanagari (Hindi) text using Python

Can anyone help me how to deal with this decoding problem in Python? 谁能帮我如何用Python处理这个解码问题? I have got this output from the topic modeling of the hindi text in Python, Now I am not able to decode it in python to get the output in Devanagari (Hindi) language 我已经从Python中的印地文文本的主题建模中获得了此输出,现在我无法在python中对其进行解码以获取Devanagari(Hindi)语言的输出

[(0, u'0.573*"\u0915" + 0.360*"\u0930" + 0.304*"\u092e" + 0.270*"\u0928" + 0.246*"\u0938" + 0.217*"\u0932" + 0.189*"\u0926" + 0.189*"\u0924" + 0.184*"\u0939" + 0.182*"\u092f"'),
 (1, u'-0.485*"\u092e" + 0.381*"\u0924" + -0.359*"\u091f" + 0.307*"\u0935" + 0.260*"\u092c" + 0.229*"\u0926" + 0.202*"\u0939" + -0.147*"\u0938" + 0.133*"\u0926\u0930" + -0.126*"\u092a"'),
 (2, u'-0.378*"\u0938" + -0.343*"\u0932" + -0.295*"\u0935" + 0.276*"\u0930" + 0.272*"\u0915" + 0.268*"\u0926" + -0.253*"\u0939" + -0.192*"\u091f" + -0.163*"\u0926\u0930" + -0.148*"\u091c"'),
 (3, u'-0.508*"\u0930" + 0.392*"\u0924" + -0.323*"\u0938" + 0.296*"\u092e" + 0.179*"\u0939" + 0.178*"\u091a" + 0.169*"\u092f" + -0.166*"\u091c" + -0.133*"\u090f" + -0.125*"\u092a"'), 
 (4, u'0.514*"\u0938" + -0.308*"\u0917" + -0.280*"\u091c" + -0.256*"\u0930" + 0.229*"\u0939" + -0.227*"\u092f" + 0.208*"\u0915" + -0.201*"\u0928" + -0.175*"\u0932" + 0.173*"\u0926"')]

The strings such as "\क" that are embedded into your data are Unicode escape sequences of Devanagari glyphs. 嵌入到您的数据的字符串,例如"\क"是Devanagari字形的Unicode转义序列。 These escape sequences are used to maximise the portability of the data. 这些转义序列用于最大化数据的可移植性。

Here's some Python 2 code that uses a Regular Expression pattern to extract the numbers and glyphs from that data. 这是一些使用正则表达式模式的Python 2代码,用于从该数据中提取数字和字形。

import re

data = [
    (0, u'0.573*"\u0915" + 0.360*"\u0930" + 0.304*"\u092e" + 0.270*"\u0928" + 0.246*"\u0938" + 0.217*"\u0932" + 0.189*"\u0926" + 0.189*"\u0924" + 0.184*"\u0939" + 0.182*"\u092f"'),
    (1, u'-0.485*"\u092e" + 0.381*"\u0924" + -0.359*"\u091f" + 0.307*"\u0935" + 0.260*"\u092c" + 0.229*"\u0926" + 0.202*"\u0939" + -0.147*"\u0938" + 0.133*"\u0926\u0930" + -0.126*"\u092a"'),
    (2, u'-0.378*"\u0938" + -0.343*"\u0932" + -0.295*"\u0935" + 0.276*"\u0930" + 0.272*"\u0915" + 0.268*"\u0926" + -0.253*"\u0939" + -0.192*"\u091f" + -0.163*"\u0926\u0930" + -0.148*"\u091c"'),
    (3, u'-0.508*"\u0930" + 0.392*"\u0924" + -0.323*"\u0938" + 0.296*"\u092e" + 0.179*"\u0939" + 0.178*"\u091a" + 0.169*"\u092f" + -0.166*"\u091c" + -0.133*"\u090f" + -0.125*"\u092a"'), 
    (4, u'0.514*"\u0938" + -0.308*"\u0917" + -0.280*"\u091c" + -0.256*"\u0930" + 0.229*"\u0939" + -0.227*"\u092f" + 0.208*"\u0915" + -0.201*"\u0928" + -0.175*"\u0932" + 0.173*"\u0926"')
]

pat = re.compile(r'(.*?)\*"(.*?)"\s*\+?\s*')

for i, row in data:
    print "\nRow", i
    t = [(float(w), s) for w, s in pat.findall(row)]
    for w, s in t: 
        print w, s

output 输出

Row 0
0.573 क
0.36 र
0.304 म
0.27 न
0.246 स
0.217 ल
0.189 द
0.189 त
0.184 ह
0.182 य

Row 1
-0.485 म
0.381 त
-0.359 ट
0.307 व
0.26 ब
0.229 द
0.202 ह
-0.147 स
0.133 दर
-0.126 प

Row 2
-0.378 स
-0.343 ल
-0.295 व
0.276 र
0.272 क
0.268 द
-0.253 ह
-0.192 ट
-0.163 दर
-0.148 ज

Row 3
-0.508 र
0.392 त
-0.323 स
0.296 म
0.179 ह
0.178 च
0.169 य
-0.166 ज
-0.133 ए
-0.125 प

Row 4
0.514 स
-0.308 ग
-0.28 ज
-0.256 र
0.229 ह
-0.227 य
0.208 क
-0.201 न
-0.175 ल
0.173 द

To get this output, you should set your terminal to use UTF-8 encoding. 要获得此输出,应将终端设置为使用UTF-8编码。


FWIW, here's your data in a more user-friendly form. FWIW,这是一种更加用户友好的形式的数据。 To use it, you need to tell your editor to save your script with UTF-8 encoding, and you must have a valid UTF-8 encoding declaration at the start of the script, eg 要使用它,您需要告诉编辑器以UTF-8编码保存脚本,并且在脚本开始处必须具有有效的UTF-8编码声明,例如

# -*- coding: utf-8 -*- 

data = [
    (0, u'0.573*"क" + 0.360*"र" + 0.304*"म" + 0.270*"न" + 0.246*"स" + 0.217*"ल" + 0.189*"द" + 0.189*"त" + 0.184*"ह" + 0.182*"य"'),
    (1, u'-0.485*"म" + 0.381*"त" + -0.359*"ट" + 0.307*"व" + 0.260*"ब" + 0.229*"द" + 0.202*"ह" + -0.147*"स" + 0.133*"दर" + -0.126*"प"'),
    (2, u'-0.378*"स" + -0.343*"ल" + -0.295*"व" + 0.276*"र" + 0.272*"क" + 0.268*"द" + -0.253*"ह" + -0.192*"ट" + -0.163*"दर" + -0.148*"ज"'),
    (3, u'-0.508*"र" + 0.392*"त" + -0.323*"स" + 0.296*"म" + 0.179*"ह" + 0.178*"च" + 0.169*"य" + -0.166*"ज" + -0.133*"ए" + -0.125*"प"'),
    (4, u'0.514*"स" + -0.308*"ग" + -0.280*"ज" + -0.256*"र" + 0.229*"ह" + -0.227*"य" + 0.208*"क" + -0.201*"न" + -0.175*"ल" + 0.173*"द"')
]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM