简体   繁体   English

python regex:从文件读取时,regex失败

[英]python regex: regex fails when reading from file

My regex 我的regex

table_names = ['check_channel_types', 'qualitycheckresult']
tables_group = '|'.join(table_names)
pattern = re.compile('(CREATE TABLE "({0})"(.*?);)'.format(tables_group), re.DOTALL)
match = pattern.findall(s)

works fine with this test-string: 使用此测试字符串可以正常工作:

s ="""CREATE TABLE "check_boundary_nodes" (
    "id" serial NOT NULL PRIMARY KEY,
    "test_name" varchar(120),
    "field_name" varchar(120),
    "original_pk" varchar(15),
    "check_result" varchar(255),
    "constraint" varchar(120),
    "the_geom" geometry(GEOMETRY,28992)
)
;
CREATE TABLE "check_channel_types" (
    "id" serial NOT NULL PRIMARY KEY,
    "original_pk" integer CHECK ("original_pk" >= 0) NOT NULL,
    "channel_inp_id" integer CHECK ("channel_inp_id" >= 0),
    "type" integer CHECK ("type" >= 0),
    "suggested_type" integer CHECK ("suggested_type" >= 0),
    "the_geom" geometry(LINESTRING,28992)
)
;
CREATE TABLE "qualitycheckresult" (
    "id" serial NOT NULL PRIMARY KEY,
    "qualitycheck" varchar(512) NOT NULL,
    "created" timestamp with time zone NOT NULL,
    "result" integer NOT NULL,
    "resultvalue" varchar(256) NOT NULL,
    "message" varchar(512) NOT NULL,
    "object_id" integer,
    "object_type" varchar(512) NOT NULL,
    "old_value" text NOT NULL
)
;"""  

Once I read the text from a file-like object, the regular expression fails (does not find any matches). 从类似文件的对象中读取文本后,正则表达式将失败(找不到任何匹配项)。 I assume it has to do with the quote characters but find it hard to debug as the string I'm reading from the 'file' is very long. 我认为它与引号字符有关,但是由于我从“文件”中读取的字符串很长,因此很难调试。 What really feels strange about this is that internally it should not make a difference if it is a triple quoted string or not. 对此真正感到奇怪的是,无论是否使用三引号的字符串,在内部都不会有所作为。 Any help is highly appreciated. 非常感谢您的帮助。 This is how I retrieve the data in my app: 这是我在应用程序中检索数据的方式:

from StringIO import StringIO
content = StringIO()
call_command('sql', 'my_app', database=self.alias,
             stdout=content)
content.seek(0)
a = content.getvalue()
type(a)
>>> <type 'str'>

Try to write the output as utf8 encoding. 尝试将输出写为utf8编码。 And add the flags: 并添加标志:

re.MULTILINE|re.DOTALL

To the regex. 到正则表达式。

f_stream = open( “my_dumpfile.txt”,     ‘w’, encoding=”utf-8″ )
call_command( “dumpdata”, indent=4, stdout=f_stream )

Or: 要么:

content = StringIO(content.read().decode('utf8'))

Or: 要么:

a = content.read().decode('utf8')

Ok, let's wrap this up. 好,让我们总结一下。

The reason the regex fails when reading from the StringIO object is that the strings contain ANSI escape sequences (apparently to give certain lines a different color). StringIO对象读取时,正则表达式失败的原因是这些字符串包含ANSI转义序列(显然为某些行提供了不同的颜色)。 This answer shows how to remove the escape sequences using a regular expression. 此答案显示了如何使用正则表达式删除转义序列。 Then everything works just like expected. 然后一切都按预期工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM