简体   繁体   English

将unicode输入转换为字符串以进行比较

[英]Convert a unicode input to string for comparison

i am writing code which parses a word document table and compares to a keyword which is an ascii string 我正在编写解析word文档表的代码,并与ascii字符串的关键字进行比较

tyring = unicode((ListTables[0].Rows[x])).encode('utf-8')
tryingstring = tyring.encode('ascii')
print 'trying string' ,tryingstring

ERROR BELOW: 错误如下:

tyring = unicode((ListTables[0].Rows[x])).encode('utf-8','ignore')
File "C:\Python27\lib\site-packages\win32com\client\dynamic.py", line 201, in __str__
    return str(self.__call__())
File "C:\Python27\lib\site-packages\win32com\client\dynamic.py", line 201, in __str__
    return str(self.__call__())
UnicodeEncodeError: 'ascii' codec can't encode character u'\uf07a' in position 0: ordinal not in range(128)

It does not print it should though since trying string is a ascii string now ? 虽然现在尝试字符串是ascii字符串,但它不会打印它吗?

Going back to your original posting: 回到原来的帖子:

if tr1_find.search(str(ListTables[0].Cell(x,y))):
    print 'Found'
    value  = ListTables[0].Cell(x,y+1)

ListTables[0].Cell(x,y) returns a Cell instance from the Word document. ListTables[0].Cell(x,y)从Word文档返回一个Cell实例。 Calling str() on it retrieves its Unicode value and tries to encode it to a byte string using the ascii codec. 在其上调用str()会检索其Unicode值,并尝试使用ascii编解码器将其编码为字节字符串。 Since it contains non-ASCII characters it fails with UnicodeEncodingError . 由于它包含非ASCII字符,因此无法使用UnicodeEncodingError

In your later edit: 在以后的编辑中:

tyring = unicode((ListTables[0].Rows[x])).encode('utf-8')
tryingstring = tyring.encode('ascii')
print 'trying string' ,tryingstring

unicode does retrieve the Unicode value, converts it to a UTF-8 byte string, and stores it in tyring . unicode会检索Unicode值,将其转换为UTF-8字节字符串,并将其存储在tyring The next line tries to encode the byte string again to 'ascii'. 下一行尝试再次将字节串编码为'ascii'。 This isn't valid, because only Unicode strings can be encoded, so Python first attempts to convert the byte string back to a Unicode string using the default 'ascii' codec. 这是无效的,因为只能对Unicode字符串进行编码,因此Python首先尝试使用默认的“ascii”编解码器将字节字符串转换回Unicode字符串。 This causes a UnicodeDecodingError (not Encoding ). 这会导致UnicodeDecodingError (不是编码 )。

Best practice is to do all string processing in Unicode. 最佳做法是使用Unicode进行所有字符串处理。 What you are missing is the Range() method to get the value of the cell. 你缺少的是Range()方法来获取单元格的值。 Here's an example accessing a Word document table: 以下是访问Word文档表的示例:

PythonWin 2.7.1 (r271:86832, Nov 27 2010, 18:30:46) [MSC v.1500 32 bit (Intel)] on win32.
Portions Copyright 1994-2008 Mark Hammond - see 'Help/About PythonWin' for further copyright information.
>>> import win32com.client
>>> word=win32com.client.gencache.EnsureDispatch('Word.Application')
>>> word.ActiveDocument.Tables[0].Cell(1,1).Range()
u'One\u4e00\r\x07'

Note it is a Unicode string. 请注意,它是一个Unicode字符串。 Word also seems to use \\r\\x07 as a cell line terminator. Word似乎也使用\\r\\x07作为细胞系终止子。

Now you can test the value: 现在您可以测试该值:

>>> value = word.ActiveDocument.Tables[0].Cell(1,1).Range()
>>> value == 'One'   # NOTE: Python converts byte strings to Unicode via the default codec ('ascii' in Python 2.X)
False
>>> value == u'One'
False
>>> value == u'One马\r\x07'
False
>>> value == u'One一\r\x07'
True
>>> value == u'One\u4e00\r\x07'
True
>>> value == 'One\x92' # non-ASCII byte string fails to convert
__main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
False

转换为Unicode字符串并转换使用encode()

if tr1_find.search(unicode(ListTables[0].Cell(x,y)).encode('utf-8')):

Try this, I'm wondering if it might help: 试试这个,我想知道它是否有帮助:

if tr1_find.search(unicode(ListTables[0].Cell(x,y)).encode('utf-8','ignore')):

You might also find this page from Python's documentation helpful: http://docs.python.org/howto/unicode.html 您可能还会从Python的文档中找到有用的页面: http//docs.python.org/howto/unicode.html

It covers this exact sort of problem. 它涵盖了这种确切的问题。

Did you open the file using the codecs.open() ? 你用codecs.open()打开文件了吗? You can specify the file encoding in that function. 您可以在该函数中指定文件编码。

http://docs.python.org/library/codecs.html http://docs.python.org/library/codecs.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM