简体   繁体   中英

Convert a unicode input to string for comparison

i am writing code which parses a word document table and compares to a keyword which is an ascii string

tyring = unicode((ListTables[0].Rows[x])).encode('utf-8')
tryingstring = tyring.encode('ascii')
print 'trying string' ,tryingstring

ERROR BELOW:

tyring = unicode((ListTables[0].Rows[x])).encode('utf-8','ignore')
File "C:\Python27\lib\site-packages\win32com\client\dynamic.py", line 201, in __str__
    return str(self.__call__())
File "C:\Python27\lib\site-packages\win32com\client\dynamic.py", line 201, in __str__
    return str(self.__call__())
UnicodeEncodeError: 'ascii' codec can't encode character u'\uf07a' in position 0: ordinal not in range(128)

It does not print it should though since trying string is a ascii string now ?

Going back to your original posting:

if tr1_find.search(str(ListTables[0].Cell(x,y))):
    print 'Found'
    value  = ListTables[0].Cell(x,y+1)

ListTables[0].Cell(x,y) returns a Cell instance from the Word document. Calling str() on it retrieves its Unicode value and tries to encode it to a byte string using the ascii codec. Since it contains non-ASCII characters it fails with UnicodeEncodingError .

In your later edit:

tyring = unicode((ListTables[0].Rows[x])).encode('utf-8')
tryingstring = tyring.encode('ascii')
print 'trying string' ,tryingstring

unicode does retrieve the Unicode value, converts it to a UTF-8 byte string, and stores it in tyring . The next line tries to encode the byte string again to 'ascii'. This isn't valid, because only Unicode strings can be encoded, so Python first attempts to convert the byte string back to a Unicode string using the default 'ascii' codec. This causes a UnicodeDecodingError (not Encoding ).

Best practice is to do all string processing in Unicode. What you are missing is the Range() method to get the value of the cell. Here's an example accessing a Word document table:

PythonWin 2.7.1 (r271:86832, Nov 27 2010, 18:30:46) [MSC v.1500 32 bit (Intel)] on win32.
Portions Copyright 1994-2008 Mark Hammond - see 'Help/About PythonWin' for further copyright information.
>>> import win32com.client
>>> word=win32com.client.gencache.EnsureDispatch('Word.Application')
>>> word.ActiveDocument.Tables[0].Cell(1,1).Range()
u'One\u4e00\r\x07'

Note it is a Unicode string. Word also seems to use \\r\\x07 as a cell line terminator.

Now you can test the value:

>>> value = word.ActiveDocument.Tables[0].Cell(1,1).Range()
>>> value == 'One'   # NOTE: Python converts byte strings to Unicode via the default codec ('ascii' in Python 2.X)
False
>>> value == u'One'
False
>>> value == u'One马\r\x07'
False
>>> value == u'One一\r\x07'
True
>>> value == u'One\u4e00\r\x07'
True
>>> value == 'One\x92' # non-ASCII byte string fails to convert
__main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
False

转换为Unicode字符串并转换使用encode()

if tr1_find.search(unicode(ListTables[0].Cell(x,y)).encode('utf-8')):

Try this, I'm wondering if it might help:

if tr1_find.search(unicode(ListTables[0].Cell(x,y)).encode('utf-8','ignore')):

You might also find this page from Python's documentation helpful: http://docs.python.org/howto/unicode.html

It covers this exact sort of problem.

Did you open the file using the codecs.open() ? You can specify the file encoding in that function.

http://docs.python.org/library/codecs.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM