简体   繁体   中英

Topic modeling on the Devanagari (Hindi) text using Python

Can anyone help me how to deal with this decoding problem in Python? I have got this output from the topic modeling of the hindi text in Python, Now I am not able to decode it in python to get the output in Devanagari (Hindi) language

[(0, u'0.573*"\u0915" + 0.360*"\u0930" + 0.304*"\u092e" + 0.270*"\u0928" + 0.246*"\u0938" + 0.217*"\u0932" + 0.189*"\u0926" + 0.189*"\u0924" + 0.184*"\u0939" + 0.182*"\u092f"'),
 (1, u'-0.485*"\u092e" + 0.381*"\u0924" + -0.359*"\u091f" + 0.307*"\u0935" + 0.260*"\u092c" + 0.229*"\u0926" + 0.202*"\u0939" + -0.147*"\u0938" + 0.133*"\u0926\u0930" + -0.126*"\u092a"'),
 (2, u'-0.378*"\u0938" + -0.343*"\u0932" + -0.295*"\u0935" + 0.276*"\u0930" + 0.272*"\u0915" + 0.268*"\u0926" + -0.253*"\u0939" + -0.192*"\u091f" + -0.163*"\u0926\u0930" + -0.148*"\u091c"'),
 (3, u'-0.508*"\u0930" + 0.392*"\u0924" + -0.323*"\u0938" + 0.296*"\u092e" + 0.179*"\u0939" + 0.178*"\u091a" + 0.169*"\u092f" + -0.166*"\u091c" + -0.133*"\u090f" + -0.125*"\u092a"'), 
 (4, u'0.514*"\u0938" + -0.308*"\u0917" + -0.280*"\u091c" + -0.256*"\u0930" + 0.229*"\u0939" + -0.227*"\u092f" + 0.208*"\u0915" + -0.201*"\u0928" + -0.175*"\u0932" + 0.173*"\u0926"')]

The strings such as "\क" that are embedded into your data are Unicode escape sequences of Devanagari glyphs. These escape sequences are used to maximise the portability of the data.

Here's some Python 2 code that uses a Regular Expression pattern to extract the numbers and glyphs from that data.

import re

data = [
    (0, u'0.573*"\u0915" + 0.360*"\u0930" + 0.304*"\u092e" + 0.270*"\u0928" + 0.246*"\u0938" + 0.217*"\u0932" + 0.189*"\u0926" + 0.189*"\u0924" + 0.184*"\u0939" + 0.182*"\u092f"'),
    (1, u'-0.485*"\u092e" + 0.381*"\u0924" + -0.359*"\u091f" + 0.307*"\u0935" + 0.260*"\u092c" + 0.229*"\u0926" + 0.202*"\u0939" + -0.147*"\u0938" + 0.133*"\u0926\u0930" + -0.126*"\u092a"'),
    (2, u'-0.378*"\u0938" + -0.343*"\u0932" + -0.295*"\u0935" + 0.276*"\u0930" + 0.272*"\u0915" + 0.268*"\u0926" + -0.253*"\u0939" + -0.192*"\u091f" + -0.163*"\u0926\u0930" + -0.148*"\u091c"'),
    (3, u'-0.508*"\u0930" + 0.392*"\u0924" + -0.323*"\u0938" + 0.296*"\u092e" + 0.179*"\u0939" + 0.178*"\u091a" + 0.169*"\u092f" + -0.166*"\u091c" + -0.133*"\u090f" + -0.125*"\u092a"'), 
    (4, u'0.514*"\u0938" + -0.308*"\u0917" + -0.280*"\u091c" + -0.256*"\u0930" + 0.229*"\u0939" + -0.227*"\u092f" + 0.208*"\u0915" + -0.201*"\u0928" + -0.175*"\u0932" + 0.173*"\u0926"')
]

pat = re.compile(r'(.*?)\*"(.*?)"\s*\+?\s*')

for i, row in data:
    print "\nRow", i
    t = [(float(w), s) for w, s in pat.findall(row)]
    for w, s in t: 
        print w, s

output

Row 0
0.573 क
0.36 र
0.304 म
0.27 न
0.246 स
0.217 ल
0.189 द
0.189 त
0.184 ह
0.182 य

Row 1
-0.485 म
0.381 त
-0.359 ट
0.307 व
0.26 ब
0.229 द
0.202 ह
-0.147 स
0.133 दर
-0.126 प

Row 2
-0.378 स
-0.343 ल
-0.295 व
0.276 र
0.272 क
0.268 द
-0.253 ह
-0.192 ट
-0.163 दर
-0.148 ज

Row 3
-0.508 र
0.392 त
-0.323 स
0.296 म
0.179 ह
0.178 च
0.169 य
-0.166 ज
-0.133 ए
-0.125 प

Row 4
0.514 स
-0.308 ग
-0.28 ज
-0.256 र
0.229 ह
-0.227 य
0.208 क
-0.201 न
-0.175 ल
0.173 द

To get this output, you should set your terminal to use UTF-8 encoding.


FWIW, here's your data in a more user-friendly form. To use it, you need to tell your editor to save your script with UTF-8 encoding, and you must have a valid UTF-8 encoding declaration at the start of the script, eg

# -*- coding: utf-8 -*- 

data = [
    (0, u'0.573*"क" + 0.360*"र" + 0.304*"म" + 0.270*"न" + 0.246*"स" + 0.217*"ल" + 0.189*"द" + 0.189*"त" + 0.184*"ह" + 0.182*"य"'),
    (1, u'-0.485*"म" + 0.381*"त" + -0.359*"ट" + 0.307*"व" + 0.260*"ब" + 0.229*"द" + 0.202*"ह" + -0.147*"स" + 0.133*"दर" + -0.126*"प"'),
    (2, u'-0.378*"स" + -0.343*"ल" + -0.295*"व" + 0.276*"र" + 0.272*"क" + 0.268*"द" + -0.253*"ह" + -0.192*"ट" + -0.163*"दर" + -0.148*"ज"'),
    (3, u'-0.508*"र" + 0.392*"त" + -0.323*"स" + 0.296*"म" + 0.179*"ह" + 0.178*"च" + 0.169*"य" + -0.166*"ज" + -0.133*"ए" + -0.125*"प"'),
    (4, u'0.514*"स" + -0.308*"ग" + -0.280*"ज" + -0.256*"र" + 0.229*"ह" + -0.227*"य" + 0.208*"क" + -0.201*"न" + -0.175*"ल" + 0.173*"द"')
]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM