简体   繁体   English

正则表达式的Python 3.6.3与2.7.3:同一脚本的结果不同

[英]Python 3.6.3 vs 2.7.3 for regular expressions: same script different results

I am running the same script with Python versions 3.6.3 and 2.7.3. 我在Python版本3.6.3和2.7.3中运行相同的脚本。 The script works fine in 2.7.3, but not in 3.6.3. 该脚本可以在2.7.3中正常运行,但不能在3.6.3中正常运行。 It seems the difference is in the regular expression portion of my code. 似乎区别在于我代码的正则表达式部分。

I am searching for some strings in the same external file for both script versions, saving the hits in lists. 我正在两个脚本版本的相同外部文件中搜索一些字符串,将命中保存在列表中。 The len() of the resulting lists are different for the two versions. 对于两个版本,结果列表的len()不同。

I tried to make a MWE that reproduces the error by creating a small file to use for the regexes, but then both versions of Python produce the same output. 我试图通过创建一个用于正则表达式的小文件来制作可重现该错误的MWE,但随后两个版本的Python都产生相同的输出。 The only solution I have is to provide the original file. 我唯一的解决方案是提供原始文件。 But this is quite a long text file, so you can download it from here: https://ufile.io/jjc56 This file is available for 30 days. 但这是一个很长的文本文件,因此您可以从此处下载: https : //ufile.io/jjc56该文件有效期为30天。 I thought perhaps this was better than pasting everything into the question. 我认为这可能比将所有内容粘贴到问题中更好。

This piece of code reproduces the error. 这段代码再现了错误。

import re

inputfile = "opt-guess-firsttetint-r-h2o.out"
with open(inputfile,"r") as input_file:
    input_string = input_file.read()
    input_file.close()

match_geometry = list(re.findall('CARTESIAN COORDINATES \(ANGSTROEM\)(.*?)CARTESIAN COORDINATES \(A\.U\.\)', input_string, re.DOTALL))

match_energy = list(re.findall('FINAL SINGLE POINT ENERGY(.*?)-------------------------', input_string, re.DOTALL))

print(len(match_geometry))
print(len(match_energy))

Output with Python 3.6.3: 使用Python 3.6.3输出:

78
77

Output with Python 2.7.3: 使用Python 2.7.3的输出:

188
188

For comparison: 为了比较:

$ grep "CARTESIAN COORDINATES (ANGSTROEM)" externalfile | wc -l
> 188

$ grep "FINAL SINGLE POINT ENERGY" externalfile | wc -l
> 188

If you need more information, please say so! 如果您需要更多信息,请这样说!

The main difference between Python 2 and Python 3 is text handling: while in Python 2 text is treated like in bare C, ie a sequence of bytes which happen to match ASCII characters in the range 32-128, that is not true for Python 3 - where the bytes in your file are assumed to be in some text encoding, and decoded to proper unicode character points before being treated in the program. Python 2和Python 3之间的主要区别在于文本处理:而在Python 2中,文本的处理方式就像裸C中一样,即,恰好匹配32-128范围内的ASCII字符的字节序列,而对于Python 3则不是这样-假定文件中的字节采用某种文本编码,并在程序中进行处理之前先解码为适当的unicode字符点。

Likewise, in Python2, regexps operate by default on "byte strings", and on Python 3 on text strings (in Python 2 you can work with text as well if both the expression and the text are 'unicode' objects, rather than 'str') 同样,在Python2中,默认情况下,正则表达式在“字节字符串”上运行,而在Python 3上则在文本字符串上运行(在Python 2中,如果表达式和文本均为“ unicode”对象,而不是“ str”,则也可以使用文本) ')

We'd need more context there, but your problem likely lies on Python 3 reading your text file assuming an incorrect encoding - like, your data is utf-8, but Python is assuming it as Latin 1 - that would read characters out of the ASCII range as incorrect, without giving you an error, since all bytes from 0-255 are valid Latin-1 - but the resulting mojibake would fail the regexp. 我们在那里需要更多上下文,但是您的问题可能出在Python 3读取文本文件时,假设编码不正确-例如,您的数据为utf-8,但Python假定其为Latin 1-会从中读取字符ASCII范围不正确,没有给您任何错误,因为0-255之间的所有字节均是有效的Latin-1-但生成的mojibake会使regexp失败。

Just force a proper encoding="..." to match your file when reading your data and you should be fine. 读取数据时,只需强制使用正确的encoding="..."来匹配您的文件,就可以了。

FYI, one character that would trigger the behavior I described above is "Å" - which I don't find unlikely to occur in this particular case. 仅供参考,会触发我上述行为的一个字符是“Å”-在这种特殊情况下我不太可能发生。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM