使用python 3.x计算与特定正则表达式模式匹配的行

Question

我有一个看起来像这样的源UTF8文件（没有BOM，Windows EOL）：

~someunicodetext_someunicodetext_someunicodetext~
some_more_unicode_text_some_more_unicode_text

~someunicodetext_someunicodetext_someunicodetext~
some_more_unicode_text_some_more_unicode_text
&&even_more_text_here

~someunicodetext_someunicodetext_someunicodetext~
some_more_unicode_text_some_more_unicode_text

~someunicodetext_someunicodetext_someunicodetext~

因此，共有3种类型的行（如果算上空白行则为4种）。 我的目标是使用python regex计算每种非空白类型。 这绝对是使用python 3.x的基于正则表达式的解决方案，因为我想了解它的工作原理。

我的python脚本看起来像这样：

import re, codecs
pattern = re.compile(r'some_expression_here')
count = 0
with codecs.open("some_input_file", "r", "UTF8") as inputFile:
    inputFile=inputFile.read()
    lines = re.findall(pattern, inputFile)
    for match in lines:
        count +=1
print (count)

我遇到的真正问题是实际的正则表达式。
~.*~在上面的示例中似乎可以匹配1、4、8之类的行（如果我们从1开始算）
&&.*匹配第6行
但是我不知道如何计算未标记的行，即第2、5、9行。
在Notepad ++中，此表达式^(?!(~.*~)|(&&.*)).*或仅此^(?!~|&).*对我有用（即使并不完全正确），但我在python中复制此文件的所有尝试都失败了...

编辑 inputFile.read()不会以我期望的方式读取文件（Hello Windows EOL）。 哪个可能重要，要么不重要。 它的输出看起来像这样：

~someunicodetext_someunicodetext_someunicodetext~

some_more_unicode_text_some_more_unicode_text



~someunicodetext_someunicodetext_someunicodetext~

some_more_unicode_text_some_more_unicode_text

&&even_more_text_here

Answer 1

    x="~someunicodetext_someunicodetext_someunicodetext~ \n   \n \nsome_more_unicode_text_some_more_unicode_text \n"
    pattern=re.compile(r"(\S+)")
    print len(pattern.findall(x))

这样可以计算除空格以外的所有行数，因此不计空白行。希望这会有所帮助。

Answer 2

您可以使用re.MULTILINE标志`尝试使用此模式^\\w.* 。

re.UNICODE标志也应用于Python 2。

这是一个完整的示例：

import re, codecs

with codecs.open("input.txt", "r", "UTF8") as inputFile:
    data = inputFile.read()
pattern = re.compile(r'^\w.*', flags=re.MULTILINE)
lines = re.findall(pattern, data)

>>> data   #  note windows line termination
'~someunicodetext_someunicodetext_someunicodetext~\r\nsome_more_unicode_text_some_more_unicode_text\r\n   \t\r\n~someunicodetext_someunicodetext_someunicodetext~\r\nsome_more_unicode_text_some_more_unicode_text\r\n&&even_more_text_here\r\n\r\n~someunicodetext_someunicodetext_someunicodetext~\r\nsome_more_unicode_text_some_more_unicode_text\r\n\r\n~someunicodetext_someunicodetext_someunicodetext~\r\n'

>>> print(lines)
['some_more_unicode_text_some_more_unicode_text\r', 'some_more_unicode_text_some_more_unicode_text\r', 'some_more_unicode_text_some_more_unicode_text\r']

>>> print(len(lines))
3

因此，正则表达式根据需要匹配“未标记”非空白行。

Answer 3

这是答案。 我仍然不确定我是否可以正确处理Windows EOL，但是似乎可行。 我也希望有人回答我的问题在哪里，以及为什么它以它的工作方式起作用，但哦，对了。

这是做什么的。 我们匹配前面有〜EOL的每一行，并以另一个EOL结尾。 同时，我们确保排除具有2个或更多连续EOL的比赛。

所以。 这仅匹配标有〜的行正下方的行

import re, codecs

regex = re.compile(r'(?!~(\r\n){2,})~\r\n.*\r\n', re.MULTILINE)
count = 0

with codecs.open('input_file', 'r', 'UTF8') as inputFile:
    inputFile=inputFile.read()
    lines = re.findall(regex, inputFile)
    for match in lines:
        count +=1
print (count)

Answer 4

“非标”线可以被认定为不属于温和的，并且不与启动线~ 不要下手& 。

因此，以下正则表达式将起作用：

^[^&\\s].*

读取： ^ =开头的匹配项， [^...] =不在其中的单个字符， &\\s =字符&或空格字符（即不是其中的一个字符）， .* =后面可以有任何字符那。

（我输入\\s以防万一，因为您说过换行符有问题。我不确定是否需要它）

另外，最好逐行读取文件。 你得到：

import re, codecs
pattern = re.compile(r'^[^&\s].*')
with codecs.open("some_input_file", "r", "UTF8") as inputFile:
    count = sum( 1 for line in inputFile if re.search(pattern, line) )
print (count)

使用python 3.x计算与特定正则表达式模式匹配的行

问题描述

4 个解决方案

解决方案1
1 2014-07-22 05:36:37

解决方案2
0 2014-07-22 01:33:23

解决方案3
0 2014-07-22 05:23:00

解决方案4
0 2014-07-22 05:34:11

使用python 3.x计算与特定正则表达式模式匹配的行

问题描述

4 个解决方案

解决方案1 1 2014-07-22 05:36:37

解决方案2 0 2014-07-22 01:33:23

解决方案3 0 2014-07-22 05:23:00

解决方案4 0 2014-07-22 05:34:11

解决方案1
1 2014-07-22 05:36:37

解决方案2
0 2014-07-22 01:33:23

解决方案3
0 2014-07-22 05:23:00

解决方案4
0 2014-07-22 05:34:11