简体   繁体   中英

Count lines that match specific regex pattern using python 3.x

I have a source UTF8 file (no BOM, windows EOL) that looks like this:

~someunicodetext_someunicodetext_someunicodetext~
some_more_unicode_text_some_more_unicode_text

~someunicodetext_someunicodetext_someunicodetext~
some_more_unicode_text_some_more_unicode_text
&&even_more_text_here

~someunicodetext_someunicodetext_someunicodetext~
some_more_unicode_text_some_more_unicode_text

~someunicodetext_someunicodetext_someunicodetext~

So there are 3 types of lines (4 if you count blank lines). My goal is to count each non-blank type using python regex. This is absolutely have to be regex-based solution using python 3.x, because I want to understand how it works.

My python script looks something like this:

import re, codecs
pattern = re.compile(r'some_expression_here')
count = 0
with codecs.open("some_input_file", "r", "UTF8") as inputFile:
    inputFile=inputFile.read()
    lines = re.findall(pattern, inputFile)
    for match in lines:
        count +=1
print (count)

The real problem I'm having is the actual regex expression.
~.*~ seem to be able to match lines like 1, 4, 8 in my example above (if we count starting from 1)
&&.* matches line 6
But I can't figure out how to count non-marked lines, which are line 2,5,9.
In Notepad++ this expression ^(?!(~.*~)|(&&.*)).* or simply this ^(?!~|&).* works for me (even though it is not exactly correct), but all my attempts to replicate this in python failed...

Edit inputFile.read() doesn't reads the file the way I expect it to (hello windows EOL). Which is may or may not be important. It's output looks like this:

~someunicodetext_someunicodetext_someunicodetext~

some_more_unicode_text_some_more_unicode_text



~someunicodetext_someunicodetext_someunicodetext~

some_more_unicode_text_some_more_unicode_text

&&even_more_text_here
    x="~someunicodetext_someunicodetext_someunicodetext~ \n   \n \nsome_more_unicode_text_some_more_unicode_text \n"
    pattern=re.compile(r"(\S+)")
    print len(pattern.findall(x))

This gives count of all lines excluding space.So blank lines don't get counted.Hope this helps.

You could try this pattern ^\\w.* with the re.MULTILINE flag`.

re.UNICODE flag should also be used for Python 2.

Here is a complete example:

import re, codecs

with codecs.open("input.txt", "r", "UTF8") as inputFile:
    data = inputFile.read()
pattern = re.compile(r'^\w.*', flags=re.MULTILINE)
lines = re.findall(pattern, data)

>>> data   #  note windows line termination
'~someunicodetext_someunicodetext_someunicodetext~\r\nsome_more_unicode_text_some_more_unicode_text\r\n   \t\r\n~someunicodetext_someunicodetext_someunicodetext~\r\nsome_more_unicode_text_some_more_unicode_text\r\n&&even_more_text_here\r\n\r\n~someunicodetext_someunicodetext_someunicodetext~\r\nsome_more_unicode_text_some_more_unicode_text\r\n\r\n~someunicodetext_someunicodetext_someunicodetext~\r\n'

>>> print(lines)
['some_more_unicode_text_some_more_unicode_text\r', 'some_more_unicode_text_some_more_unicode_text\r', 'some_more_unicode_text_some_more_unicode_text\r']

>>> print(len(lines))
3

So the regex matches the "non-marked" non-blank lines as required.

Here is the answer. I'm still not sure if I'm handling windows EOL correctly and whatnot, but this seem to be works. Also I kinda hoped someone will answer with an explanation of where my issue was and why it works the way it works, but oh well.

What this does. We match every line that has ~EOL before it and ends with another EOL. At the same time we make sure we exclude matches that have 2 or more consecutive EOLs.

So. This matches only the lines directly below the lines that are marked with ~

import re, codecs

regex = re.compile(r'(?!~(\r\n){2,})~\r\n.*\r\n', re.MULTILINE)
count = 0

with codecs.open('input_file', 'r', 'UTF8') as inputFile:
    inputFile=inputFile.read()
    lines = re.findall(regex, inputFile)
    for match in lines:
        count +=1
print (count)

The "non-marked" lines can be identified as the lines which aren't bland and do not start with ~ and do not start with & .

So the following regex would work:

^[^&\\s].*

read: ^ = match at the beginning, [^...] = a single charachter which is not in, &\\s = the charchter & or a whitespace character (ie not one of those), .* = anything can come after that.

(I put in the \\s just in case, because you said you're having problems with newlines. I'm not sure it is needed)

Also, it is much better to read the file line by line. You get:

import re, codecs
pattern = re.compile(r'^[^&\s].*')
with codecs.open("some_input_file", "r", "UTF8") as inputFile:
    count = sum( 1 for line in inputFile if re.search(pattern, line) )
print (count)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM