Python - pyparsing unicode 字符

Question

:) 我嘗試使用 w = Word(printables)，但它不起作用。 我應該如何為此提供規范。 'w' 用於處理印地語字符 (UTF-8)

代碼指定語法並相應地解析。

671.assess  :: अहसास  ::2
x=number + "." + src + "::" + w + "::" + number + "." + number

如果只有英文字符它正在工作，所以代碼對於 ascii 格式是正確的，但代碼對於 unicode 格式不起作用。

我的意思是當我們有 671.assess :: ahsaas ::2 形式的東西時，代碼可以工作

即它以英文格式解析單詞，但我不確定如何解析然后以 unicode 格式打印字符。 我需要這個用於英語印地語單詞對齊。

python 代碼如下所示：

# -*- coding: utf-8 -*-
from pyparsing import Literal, Word, Optional, nums, alphas, ZeroOrMore, printables , Group , alphas8bit , 
# grammar 
src = Word(printables)
trans =  Word(printables)
number = Word(nums)
x=number + "." + src + "::" + trans + "::" + number + "." + number
#parsing for eng-dict
efiledata = open('b1aop_or_not_word.txt').read()
eresults = x.parseString(efiledata)
edict1 = {}
edict2 = {}
counter=0
xx=list()
for result in eresults:
  trans=""#translation string
  ew=""#english word
  xx=result[0]
  ew=xx[2]
  trans=xx[4]   
  edict1 = { ew:trans }
  edict2.update(edict1)
print len(edict2) #no of entries in the english dictionary
print "edict2 has been created"
print "english dictionary" , edict2 

#parsing for hin-dict
hfiledata = open('b1aop_or_not_word.txt').read()
hresults = x.scanString(hfiledata)
hdict1 = {}
hdict2 = {}
counter=0
for result in hresults:
  trans=""#translation string
  hw=""#hin word
  xx=result[0]  
  hw=xx[2]
  trans=xx[4]
  #print trans
  hdict1 = { trans:hw }
  hdict2.update(hdict1)

print len(hdict2) #no of entries in the hindi dictionary
print"hdict2 has been created"
print "hindi dictionary" , hdict2
'''
#######################################################################################################################

def translate(d, ow, hinlist):
   if ow in d.keys():#ow=old word d=dict
    print ow , "exists in the dictionary keys"
        transes = d[ow]
    transes = transes.split()
        print "possible transes for" , ow , " = ", transes
        for word in transes:
            if word in hinlist:
        print "trans for" , ow , " = ", word
                return word
        return None
   else:
        print ow , "absent"
        return None

f = open('bidir','w')
#lines = ["'\
#5# 10 # and better performance in business in turn benefits consumers .  # 0 0 0 0 0 0 0 0 0 0 \
#5# 11 # vHyaapaar mEmn bEhtr kaam upbhOkHtaaomn kE lIe laabhpHrdd hOtaa hAI .  # 0 0 0 0 0 0 0 0 0 0 0 \
#'"]
data=open('bi_full_2','rb').read()
lines = data.split('!@#$%')
loc=0
for line in lines:
    eng, hin = [subline.split(' # ')
                for subline in line.strip('\n').split('\n')]

    for transdict, source, dest in [(edict2, eng, hin),
                                    (hdict2, hin, eng)]:
        sourcethings = source[2].split()
        for word in source[1].split():
            tl = dest[1].split()
            otherword = translate(transdict, word, tl)
            loc = source[1].split().index(word)
            if otherword is not None:
                otherword = otherword.strip()
                print word, ' <-> ', otherword, 'meaning=good'
                if otherword in dest[1].split():
                    print word, ' <-> ', otherword, 'trans=good'
                    sourcethings[loc] = str(
                        dest[1].split().index(otherword) + 1)

        source[2] = ' '.join(sourcethings)

    eng = ' # '.join(eng)
    hin = ' # '.join(hin)
    f.write(eng+'\n'+hin+'\n\n\n')
f.close()
'''

如果源文件的示例輸入語句是：

1# 5 # modern markets : confident consumers  # 0 0 0 0 0 
1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa .  # 0 0 0 0 0 0 
!@#$%

輸出看起來像這樣：-

1# 5 # modern markets : confident consumers  # 1 2 3 4 5 
1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa .  # 1 2 3 4 5 0 
!@#$%

輸出說明：- 這實現了雙向對齊。 這意味着英語“現代”的第一個詞映射到印地語“AddhUnIk”的第一個詞，反之亦然。 這里甚至字符也被當作單詞，因為它們也是雙向映射的一個組成部分。 因此，如果您觀察印地語單詞 '.' 有一個空對齊，它相對於英語句子沒有映射，因為它沒有句號。 當我們正在處理您試圖實現雙向映射的許多句子時，輸出中的第三行基本上代表一個分隔符。

如果我有 Unicode(UTF-8) 格式的印地語句子，我應該做哪些修改才能使其工作。

Answer 1

Pyparsing 的printables僅處理 ASCII 字符范圍內的字符串。 您需要完整 Unicode 范圍內的可打印文件，如下所示：

unicodePrintables = u''.join(unichr(c) for c in xrange(sys.maxunicode) 
                                        if not unichr(c).isspace())

現在您可以使用這組更完整的非空格字符來定義trans ：

trans = Word(unicodePrintables)

我無法針對您的印地語測試字符串進行測試，但我認為這可以解決問題。

（如果您使用的是 Python 3，則沒有單獨的 unichr 函數，也沒有 xrange 生成器，只需使用：

unicodePrintables = ''.join(chr(c) for c in range(sys.maxunicode) 
                                        if not chr(c).isspace())

編輯：

隨着最近發布的 pyparsing 2.3.0，已經定義了新的命名空間類來為各種 Unicode 語言范圍提供printables 、 alphas 、 nums和alphanums 。

import pyparsing as pp
pp.Word(pp.pyparsing_unicode.printables)
pp.Word(pp.pyparsing_unicode.Devanagari.printables)
pp.Word(pp.pyparsing_unicode.देवनागरी.printables)

Answer 2

作為一般規則，不處理編碼的字節串：讓他們到適當的Unicode字符串（通過調用其.decode法）盡快，做你的處理總是Unicode字符串，然后，如果你要為I / O目的， .encode它們編碼回您需要的任何字節.encode編碼。

如果你在談論文字，就像你在你的代碼中一樣，“盡快”是一次：使用u'...'來表達你的文字。 在更一般的情況下，您被迫以編碼形式執行 I/O，它緊接在輸入之后（就像如果您需要以特定編碼形式執行輸出，則緊接在輸出之前）。

Answer 3

我正在搜索法語 unicode 字符並落在這個問題上。 如果您搜索法語或其他拉丁口音，使用pyparsing 2.3.0您可以使用：

>>> pp.pyparsing_unicode.Latin1.alphas
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzªµºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ'

Python - pyparsing unicode 字符

問題描述

3 個解決方案

解決方案1
27 2010-02-26 09:43:50

解決方案2
7 已采納 2010-02-26 06:08:08

解決方案3
1 2019-11-03 22:45:54

Python - pyparsing unicode 字符

問題描述

3 個解決方案

解決方案1 27 2010-02-26 09:43:50

解決方案2 7 已采納 2010-02-26 06:08:08

解決方案3 1 2019-11-03 22:45:54

解決方案1
27 2010-02-26 09:43:50

解決方案2
7 已采納 2010-02-26 06:08:08

解決方案3
1 2019-11-03 22:45:54