Topic modeling on the Devanagari (Hindi) text using Python

Question

Can anyone help me how to deal with this decoding problem in Python? I have got this output from the topic modeling of the hindi text in Python, Now I am not able to decode it in python to get the output in Devanagari (Hindi) language

[(0, u'0.573*"\u0915" + 0.360*"\u0930" + 0.304*"\u092e" + 0.270*"\u0928" + 0.246*"\u0938" + 0.217*"\u0932" + 0.189*"\u0926" + 0.189*"\u0924" + 0.184*"\u0939" + 0.182*"\u092f"'),
 (1, u'-0.485*"\u092e" + 0.381*"\u0924" + -0.359*"\u091f" + 0.307*"\u0935" + 0.260*"\u092c" + 0.229*"\u0926" + 0.202*"\u0939" + -0.147*"\u0938" + 0.133*"\u0926\u0930" + -0.126*"\u092a"'),
 (2, u'-0.378*"\u0938" + -0.343*"\u0932" + -0.295*"\u0935" + 0.276*"\u0930" + 0.272*"\u0915" + 0.268*"\u0926" + -0.253*"\u0939" + -0.192*"\u091f" + -0.163*"\u0926\u0930" + -0.148*"\u091c"'),
 (3, u'-0.508*"\u0930" + 0.392*"\u0924" + -0.323*"\u0938" + 0.296*"\u092e" + 0.179*"\u0939" + 0.178*"\u091a" + 0.169*"\u092f" + -0.166*"\u091c" + -0.133*"\u090f" + -0.125*"\u092a"'), 
 (4, u'0.514*"\u0938" + -0.308*"\u0917" + -0.280*"\u091c" + -0.256*"\u0930" + 0.229*"\u0939" + -0.227*"\u092f" + 0.208*"\u0915" + -0.201*"\u0928" + -0.175*"\u0932" + 0.173*"\u0926"')]

Answer 1

The strings such as "\क" that are embedded into your data are Unicode escape sequences of Devanagari glyphs. These escape sequences are used to maximise the portability of the data.

Here's some Python 2 code that uses a Regular Expression pattern to extract the numbers and glyphs from that data.

import re

data = [
    (0, u'0.573*"\u0915" + 0.360*"\u0930" + 0.304*"\u092e" + 0.270*"\u0928" + 0.246*"\u0938" + 0.217*"\u0932" + 0.189*"\u0926" + 0.189*"\u0924" + 0.184*"\u0939" + 0.182*"\u092f"'),
    (1, u'-0.485*"\u092e" + 0.381*"\u0924" + -0.359*"\u091f" + 0.307*"\u0935" + 0.260*"\u092c" + 0.229*"\u0926" + 0.202*"\u0939" + -0.147*"\u0938" + 0.133*"\u0926\u0930" + -0.126*"\u092a"'),
    (2, u'-0.378*"\u0938" + -0.343*"\u0932" + -0.295*"\u0935" + 0.276*"\u0930" + 0.272*"\u0915" + 0.268*"\u0926" + -0.253*"\u0939" + -0.192*"\u091f" + -0.163*"\u0926\u0930" + -0.148*"\u091c"'),
    (3, u'-0.508*"\u0930" + 0.392*"\u0924" + -0.323*"\u0938" + 0.296*"\u092e" + 0.179*"\u0939" + 0.178*"\u091a" + 0.169*"\u092f" + -0.166*"\u091c" + -0.133*"\u090f" + -0.125*"\u092a"'), 
    (4, u'0.514*"\u0938" + -0.308*"\u0917" + -0.280*"\u091c" + -0.256*"\u0930" + 0.229*"\u0939" + -0.227*"\u092f" + 0.208*"\u0915" + -0.201*"\u0928" + -0.175*"\u0932" + 0.173*"\u0926"')
]

pat = re.compile(r'(.*?)\*"(.*?)"\s*\+?\s*')

for i, row in data:
    print "\nRow", i
    t = [(float(w), s) for w, s in pat.findall(row)]
    for w, s in t: 
        print w, s

output

Row 0
0.573 क
0.36 र
0.304 म
0.27 न
0.246 स
0.217 ल
0.189 द
0.189 त
0.184 ह
0.182 य

Row 1
-0.485 म
0.381 त
-0.359 ट
0.307 व
0.26 ब
0.229 द
0.202 ह
-0.147 स
0.133 दर
-0.126 प

Row 2
-0.378 स
-0.343 ल
-0.295 व
0.276 र
0.272 क
0.268 द
-0.253 ह
-0.192 ट
-0.163 दर
-0.148 ज

Row 3
-0.508 र
0.392 त
-0.323 स
0.296 म
0.179 ह
0.178 च
0.169 य
-0.166 ज
-0.133 ए
-0.125 प

Row 4
0.514 स
-0.308 ग
-0.28 ज
-0.256 र
0.229 ह
-0.227 य
0.208 क
-0.201 न
-0.175 ल
0.173 द

To get this output, you should set your terminal to use UTF-8 encoding.

FWIW, here's your data in a more user-friendly form. To use it, you need to tell your editor to save your script with UTF-8 encoding, and you must have a valid UTF-8 encoding declaration at the start of the script, eg

# -*- coding: utf-8 -*- 

data = [
    (0, u'0.573*"क" + 0.360*"र" + 0.304*"म" + 0.270*"न" + 0.246*"स" + 0.217*"ल" + 0.189*"द" + 0.189*"त" + 0.184*"ह" + 0.182*"य"'),
    (1, u'-0.485*"म" + 0.381*"त" + -0.359*"ट" + 0.307*"व" + 0.260*"ब" + 0.229*"द" + 0.202*"ह" + -0.147*"स" + 0.133*"दर" + -0.126*"प"'),
    (2, u'-0.378*"स" + -0.343*"ल" + -0.295*"व" + 0.276*"र" + 0.272*"क" + 0.268*"द" + -0.253*"ह" + -0.192*"ट" + -0.163*"दर" + -0.148*"ज"'),
    (3, u'-0.508*"र" + 0.392*"त" + -0.323*"स" + 0.296*"म" + 0.179*"ह" + 0.178*"च" + 0.169*"य" + -0.166*"ज" + -0.133*"ए" + -0.125*"प"'),
    (4, u'0.514*"स" + -0.308*"ग" + -0.280*"ज" + -0.256*"र" + 0.229*"ह" + -0.227*"य" + 0.208*"क" + -0.201*"न" + -0.175*"ल" + 0.173*"द"')
]

Topic modeling on the Devanagari (Hindi) text using Python

Question

1 answers

solution1
3 2016-07-07 11:55:49

Topic modeling on the Devanagari (Hindi) text using Python

Question

1 answers

solution1 3 2016-07-07 11:55:49

solution1
3 2016-07-07 11:55:49