简体   繁体   中英

Removing lines from a text file using python and regular expressions

I have some text files, and I want to remove all lines that begin with the asterisk (“*”).

Made-up example:

words
*remove me
words
words
*remove me 

My current code fails. It follows below:

import re

program = open(program_path, "r")
program_contents = program.readlines()
program.close() 

new_contents = []
pattern = r"[^*.]"
for line in program_contents:
    match = re.findall(pattern, line, re.DOTALL)
    if match.group(0):
        new_contents.append(re.sub(pattern, "", line, re.DOTALL))
    else:
        new_contents.append(line)

print new_contents

This produces ['', '', '', '', '', '', ' ', '', ' ', '', '*', ''], which is no goo.

I'm very much a python novice, but I'm eager to learn. And I'll eventually bundle this into a function (right now I'm just trying to figure it out in an ipython notebook).

Thanks for the help!

You don't want to use a [^...] negative character class; you are matching all characters except for the * or . characters now.

* is a meta character, you want to escape that to \\* . The . 'match any character' syntax needs a multiplier to match more than one. Don't use re.DOTALL here; you are operating on a line-by-line basis but don't want to erase the newline.

There is no need to test first; if there is nothing to replace the original line is returned.

pattern = r"^\*.*"
for line in program_contents:
    new_contents.append(re.sub(pattern, "", line))

Demo:

>>> import re
>>> program_contents = '''\
... words
... *remove me
... words
... words
... *remove me 
... '''.splitlines(True)
>>> new_contents = []
>>> pattern = r"^\*.*"
>>> for line in program_contents:
...     new_contents.append(re.sub(pattern, "", line))
... 
>>> new_contents
['words\n', '\n', 'words\n', 'words\n', '\n']

Your regular expression seems to be incorrect:

[^*.]

Means match any character that isn't a ^ , * or . . When inside a bracket expression, everything after the first ^ is treated as a literal character. This means in the expression you have . is matching the . character, not a wildcard.

This is why you get "*" for lines starting with * , you're replacing every character but * ! You would also keep any . present in the original string. Since the other lines do not contain * and . , all of their characters will be replaced.

If you want to match lines beginning with * :

^\*.*

What might be easier is something like this:

pat = re.compile("^[^*]")

for line in contents:
    if re.search(pat, line):
        new_contents.append(line)

This code just keeps any line that does not start with * .

In the pattern ^[^*] , the first ^ matches the start of the string. The expression [^*] matches any character but * . So together this pattern matches any starting character of a string that isn't * .

It is a good trick to really think about when using regular expressions. Do you simply need to assert something about a string, do you need to change or remove characters in a string, do you need to match substrings?

In terms of python, you need to think about what each function is giving you and what you need to do with it. Sometimes, as in my example, you only need to know that a match was found. Sometimes you might need to do something with the match.

Sometimes re.sub isn't the fastest or the best approach. Why bother going through each line and replacing all of the characters, when you can just skip that line in total? There's no sense in making an empty string when you're filtering.

Most importantly: Do I really need a regex? (Here you don't!)

You don't really need a regular expression here. Since you know the size and position of your delimiter you can simply check like this:

if line[0] != "*": 

This will be faster than a regex. They're very powerful tools and can be neat puzzles to figure out, but for delimiters with fixed width and position, you don't really need them. A regex is much more expensive than an approach making use of this information.

You can do:

print '\n'.join(re.findall(r'^[^*].*$', ''.join(f), re.M))

Example:

txt='''\
words
*remove me
words
words
*remove me '''

import StringIO

f=StringIO.StringIO(txt)

import re

print '\n'.join(re.findall(r'^[^*].*$', ''.join(f), re.M))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM