简体   繁体   English

Python RegEx:如何处理行

[英]Python RegEx: How to deal with lines

I have a huge txt file which has the following kind of format: 我有一个巨大的txt文件,其格式如下:

BadLine
      property1=a
      property2=b
BadLine2
      property1=c
      property2=d
GOODLINE1
      property1=e
      property2=f

....and many more good and bad lines. ...以及更多的好与坏行。

What I need to do is to extract the properties of the good lines (e and f in the above example). 我需要做的是提取良好线条的属性(在上面的示例中为e和f)。

I can easily find the good lines in my file, but then how do I select the properties searching other regexs only in the block associated to goodlines? 我可以轻松地在文件中找到良好的行,但是如何仅在与良好行相关联的块中选择搜索其他正则表达式的属性呢?

Thanks guys! 多谢你们!

The following code: 如下代码:

import re

test = '''
BadLine
      property1=a
      property2=b
BadLine2
      property1=c
      property2=d
GOODLINE1
      property1=e
      property2=f
BadLine
      property1=a
      property2=b
BadLine2
      property1=c
      property2=d
GOODLINE2
      property1=e
      property2=f
'''

pattern = r'^(GOODLINE(?:[^\n]|\n )*)'

print re.compile(pattern, re.MULTILINE).findall(test)

produces these results: 产生以下结果:

['GOODLINE1\n      property1=e\n      property2=f', 'GOODLINE2\n      property1=e\n      property2=f']

The pattern matches "GOODLINE" appearing at the beginning of a line, as well as greedily matching characters after it that are not linefeeds, as well as linefeeds that are followed by space characters. 模式匹配出现在行首的“ GOODLINE”,以及匹配在其后的不是换行符的字符,以及后面跟空格字符的换行符。 If your text actually has tabs after linefeeds instead of spaces, you can change the space into a tab. 如果您的文本实际上在换行符后有制表符而不是空格,则可以将空格更改为制表符。 Alternatively, you could easily match either by changing the pattern like this: 另外,您可以通过如下更改模式来轻松匹配:

pattern = r'^(GOODLINE(?:[^\n]|\n[ \t])*)'

Once you have these matches, it is extremely easy to use regular string split() in order to extract the properties. 一旦有了这些匹配项,使用常规字符串split()提取属性就非常容易。

Alternatively, you could see if the rson package parsing satisfies your needs -- this looks like a file it could easily parse. 或者,您可以查看rson包解析是否满足您的需求-看起来像是可以轻松解析的文件。

The short answer is you can us: 简短的答案是您可以我们:

GOODLINE[\d+]*\n.*property1=(.+)*\n.*property2=(.+)*\n?

In this case the two parentheses will be the values you are looking for. 在这种情况下,两个括号将是您要查找的值。 If you have the string in a file which is created in windows/mac style, you will have different end chars:'\\r\\n' in windows and '\\r' in mac. 如果在以Windows / Mac样式创建的文件中包含字符串,则将具有不同的结束字符:Windows中为'\\ r \\ n',而在Mac中为'\\ r'。 In linux system you will have '\\n' only. 在linux系统中,只有'\\ n'。 The above pattern will match with any Goodline at the beginning or end of your string, even without any newline at the end. 上面的模式将在字符串的开头或结尾与任何Goodline相匹配,即使结尾没有任何换行符也是如此。 Your values in properties can be more than one character, as well. 您在属性中的值也可以是多个字符。

You can try a very useful website, Pythex to try your regular expressions. 您可以尝试一个非常有用的网站Pythex ,尝试您的正则表达式。

The code you can try is: 您可以尝试的代码是:

import re
pattern = re.compile('GOODLINE[\d+]*\n.*property1=(.+)*\n.*property2=(.+)*\n?')
matchRes = re.findall(pattern,'''BadLine2
      property1=c
      property2=d
GOODLINE11
      property1=e
      property2=f
BadLine2
      property1=c
      property2=d
GOODLINE11
      property1=eee34
      property2=f00
BadLine2
      property1=c
      property2=d
GOODLINE1
      property1=e
      property2=f''');

if matchRes:
    print matchRes
else:
    print 'No match'

and you will get the following results in a list which each pair is the property1 and property2 values: 并且您将在列表中获得以下结果,其中每个对都是property1和property2值:

[('e', 'f'), ('eee34', 'f00'), ('e', 'f')]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM