简体   繁体   English

使用python和正则表达式从文本文件中删除行

[英]Removing lines from a text file using python and regular expressions

I have some text files, and I want to remove all lines that begin with the asterisk (“*”). 我有一些文本文件,我想删除所有以星号(“ *”)开头的行。

Made-up example: 虚构示例:

words
*remove me
words
words
*remove me 

My current code fails. 我当前的代码失败。 It follows below: 如下所示:

import re

program = open(program_path, "r")
program_contents = program.readlines()
program.close() 

new_contents = []
pattern = r"[^*.]"
for line in program_contents:
    match = re.findall(pattern, line, re.DOTALL)
    if match.group(0):
        new_contents.append(re.sub(pattern, "", line, re.DOTALL))
    else:
        new_contents.append(line)

print new_contents

This produces ['', '', '', '', '', '', ' ', '', ' ', '', '*', ''], which is no goo. 产生[“,'','','','',', ',',', ','','*',``]],这不是问题。

I'm very much a python novice, but I'm eager to learn. 我非常了解python,但是我很想学习。 And I'll eventually bundle this into a function (right now I'm just trying to figure it out in an ipython notebook). 最后,我将其捆绑到一个函数中(现在,我只是想在ipython笔记本中弄清楚它)。

Thanks for the help! 谢谢您的帮助!

You don't want to use a [^...] negative character class; 不想使用[^...]负字符类; you are matching all characters except for the * or . 您正在匹配除*.以外的所有字符. characters now. 现在的字符。

* is a meta character, you want to escape that to \\* . *是一个元字符,您想将其转义为\\* The . . 'match any character' syntax needs a multiplier to match more than one. “匹配任何字符”语法需要一个乘数来匹配多个。 Don't use re.DOTALL here; 不要在这里使用re.DOTALL you are operating on a line-by-line basis but don't want to erase the newline. 您正在逐行操作,但不想删除换行符。

There is no need to test first; 无需先进行测试; if there is nothing to replace the original line is returned. 如果没有什么可替换的,则返回原始行。

pattern = r"^\*.*"
for line in program_contents:
    new_contents.append(re.sub(pattern, "", line))

Demo: 演示:

>>> import re
>>> program_contents = '''\
... words
... *remove me
... words
... words
... *remove me 
... '''.splitlines(True)
>>> new_contents = []
>>> pattern = r"^\*.*"
>>> for line in program_contents:
...     new_contents.append(re.sub(pattern, "", line))
... 
>>> new_contents
['words\n', '\n', 'words\n', 'words\n', '\n']

Your regular expression seems to be incorrect: 您的正则表达式似乎不正确:

[^*.]

Means match any character that isn't a ^ , * or . 均值匹配不是^*. . When inside a bracket expression, everything after the first ^ is treated as a literal character. 在方括号表达式中时,第一个^之后的所有内容均视为文字字符。 This means in the expression you have . 这意味着您具有表达式. is matching the . 与匹配。 character, not a wildcard. 字符,而不是通配符。

This is why you get "*" for lines starting with * , you're replacing every character but * ! 这就是为什么在以*开头的行中得到"*"原因,您要替换除*所有字符! You would also keep any . 您还可以保留任何内容. present in the original string. 存在于原始字符串中。 Since the other lines do not contain * and . 由于其他行不包含*. , all of their characters will be replaced. ,其所有字符将被替换。

If you want to match lines beginning with * : 如果要匹配以*开头的行:

^\*.*

What might be easier is something like this: 可能更容易的是这样的事情:

pat = re.compile("^[^*]")

for line in contents:
    if re.search(pat, line):
        new_contents.append(line)

This code just keeps any line that does not start with * . 此代码仅保留不以*开头的任何行。

In the pattern ^[^*] , the first ^ matches the start of the string. 在模式^[^*] ,第一个^匹配字符串的开头。 The expression [^*] matches any character but * . 表达式[^*]匹配除*任何字符。 So together this pattern matches any starting character of a string that isn't * . 因此,此模式将匹配不是*的字符串的任何起始字符。

It is a good trick to really think about when using regular expressions. 在使用正则表达式时要认真考虑是一个好技巧。 Do you simply need to assert something about a string, do you need to change or remove characters in a string, do you need to match substrings? 您是否只需要断言某个字符串,是否需要更改或删除字符串中的字符,是否需要匹配子字符串?

In terms of python, you need to think about what each function is giving you and what you need to do with it. 就python而言,您需要考虑每个函数所提供的功能以及需要使用的功能。 Sometimes, as in my example, you only need to know that a match was found. 有时,例如在我的示例中,您只需要知道找到了一个匹配项即可。 Sometimes you might need to do something with the match. 有时您可能需要对比赛做些事情。

Sometimes re.sub isn't the fastest or the best approach. 有时re.sub并不是最快或最好的方法。 Why bother going through each line and replacing all of the characters, when you can just skip that line in total? 当您只需要跳过该行时,为什么还要烦恼每一行并替换所有字符呢? There's no sense in making an empty string when you're filtering. 进行过滤时,没有必要创建一个空字符串。

Most importantly: Do I really need a regex? 最重要的是:我真的需要正则表达式吗? (Here you don't!) (这里没有!)

You don't really need a regular expression here. 您在这里实际上不需要正则表达式。 Since you know the size and position of your delimiter you can simply check like this: 由于您知道分隔符的大小和位置,因此可以像这样简单地检查:

if line[0] != "*": 

This will be faster than a regex. 这将比正则表达式更快。 They're very powerful tools and can be neat puzzles to figure out, but for delimiters with fixed width and position, you don't really need them. 它们是非常强大的工具,可以弄清楚难题,但是对于具有固定宽度和位置的定界符,您实际上并不需要它们。 A regex is much more expensive than an approach making use of this information. 正则表达式比使用此信息的方法要昂贵得多。

You can do: 你可以做:

print '\n'.join(re.findall(r'^[^*].*$', ''.join(f), re.M))

Example: 例:

txt='''\
words
*remove me
words
words
*remove me '''

import StringIO

f=StringIO.StringIO(txt)

import re

print '\n'.join(re.findall(r'^[^*].*$', ''.join(f), re.M))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM