简体   繁体   English

Python:比较字符串的可靠方法

[英]Python: Robust way to compare strings

I have a csv file being read into python, I then save the reader as an array (I guess). 我将一个csv文件读入python,然后将阅读器另存为数组(我想)。

I then compare the csv file results against some Oracle db results: 然后,我将csv文件结果与某些Oracle数据库结果进行比较:

readerSetSAP = []
readerSAP = csv.reader(StringIO.StringIO(request.POST['sap'].value),dialect=csv.excel)
readerSetSAP.extend(readerSAP)

empsTbl = meta.Session.query(model.Person).all();

Then use a nested loop to compare: 然后使用嵌套循环进行比较:

 if i.userid != currEmp[0].strip():
                        updated = True
                        print "userid update"

The problem is, I often have the warning: 问题是,我经常收到警告:

eWarning: Unicode unequal comparison failed to convert both arguments to Unicode - interpreting them as being unequal

So my question is: 所以我的问题是:

What is the most robust way of comparing strings of this type in Python? 在Python中比较这种类型的字符串最可靠的方法是什么?

Your problem here is not 'a robust way " to compare strings. A robust way to compare strigns in Python is the equality operator == - Your problem is that your data is being covnerted to Unicode somewhere, without you being aware of that. 这里的问题不是比较字符串的“健壮方法”,在Python中比较字符串的一种健壮方法是等于运算符== -您的问题是,您的数据在某个地方被转换为Unicode,而您没有意识到。

You, and everyone else who writes code, should be aware that text is not ASCII - not in a post 1990 world. 您和其他编写代码的人应该意识到,文本不是ASCII-不在1990年后的世界中。 Even if all of your application is restricted to English only, and should never run in an internatiol environment, you are bound to find some non-ASCII characters in peoples names, or in words like "resumé". 即使您的所有应用程序仅限于英语,并且永远不应在国际环境中运行,您也一定会在人名或“resumé”之类的单词中找到一些非ASCII字符。

Here is a Python console example of when the problem might happen: 这是何时可能发生问题的Python控制台示例:

>>> "maçã" == u"maçã"
__main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
False

Python's CSV module do no authomatic conversion, and works with byte strigns (that is - strigns aready converted to some encoding) - which means that that result you are fetching from the DB is in Unicode. Python的CSV模块不进行身份验证转换,而是与字节strigns(即strigns aready转换为某种编码)一起使用-这意味着您从数据库中获取的结果是Unicode。 Probably your connection is using some default. 可能您的连接使用的是某些默认设置。

To solve that, assuming the data in your database is correctly formatted (and you did not already lost character information during the insertion), is to decode the string read from the CSV file, using an explicit encoding - so that both are in unicode (Python's internal encoding agnostic) string format - 要解决此问题,假设数据库中的数据格式正确(并且在插入过程中您尚未丢失字符信息),则应使用显式编码对从CSV文件读取的字符串进行解码-以便两者均采用Unicode( Python内部编码不可知)字符串格式-

>>> "maçã".decode("utf-8") == u"maçã"
True

So, you do use the "decode" method on the string read form the CSV file in order to have a proepr conversion, before comparing it. 因此,在比较之前,您确实对从CSV文件读取的字符串使用了“解码”方法,以便进行proepr转换。 If you are on Windows, use the "cp1251" for decoding., In any other mainstream (application) OS it should be "utf-8". 如果您使用的是Windows,请使用“ cp1251”进行解码。在任何其他主流(应用程序)操作系统中,它应为“ utf-8”。

I'd advise reading of this piece - it is rather useful: http://www.joelonsoftware.com/articles/Unicode.html 我建议您阅读这篇文章-它非常有用: http : //www.joelonsoftware.com/articles/Unicode.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM