简体   繁体   English

尝试根据关键字从文件中删除特定行

[英]Trying to delete specific lines from file based on keyword

I have a pretty specific problem. 我有一个非常具体的问题。 I am trying to delete certain lines out of a server configuration file based on a keyword find. 我正在尝试根据关键字查找从服务器配置文件中删除某些行。 If you scroll down the code below at the bottom, I am trying to delete the block of code that has the keyword "nasdaq" in the directory line. 如果您向下滚动底部的下面的代码,我将尝试删除目录行中具有关键字“ nasdaq”的代码块。 This includes everything from the "database" line all the way to the bottom where it reads "index termName pres, eq". 这包括从“数据库”行一直到其底部读取“ index termName pres,eq”的所有内容。

What is the best way I can go about this? 我最好的解决方法是什么? String.find() ? String.find()吗? What commands should I use to delete lines above and below the keyword line? 我应该使用什么命令删除关键字行上方和下方的行?

Also, I could either delete the lines or just write to a new file and ignore the last block. 另外,我可以删除这些行,也可以只写一个新文件而忽略最后一个块。 Some guidance needed! 需要一些指导!

include         /home/tuatara/TuataraServer-2.0/etc/openldap/schema/core.schema
include         /home/tuatara/TuataraServer-2.0/etc/openldap/schema/cosine.schema
include         /home/tuatara/TuataraServer-2.0/etc/openldap/schema/inetorgperson.schema
include         /home/tuatara/TuataraServer-2.0/etc/openldap/schema/tuatara.schema
pidfile         /home/tuatara/TuataraServer-2.0/var/slapd.pid
argsfile        /home/tuatara/TuataraServer-2.0/var/slapd.args

database        ldbm
loglevel        0
directory       /home/tuatara/TuataraServer/var/openldap-ldbm-CMDB-spellchecker-20130106-06_20_31_PM
suffix          "o=CMDB-spellchecker"
suffix          "dc=CMDB-spellchecker,dc=com"
rootdn          "cn=admin,o=CMDB-spellchecker"
rootpw          tuatara
schemacheck     on
lastmod         off
sizelimit       100000
defaultaccess   read
dbnolocking
dbnosync
cachesize       100000
dbcachesize     1000000
dbcacheNoWsync
index           objectclass pres,eq
index           default pres,eq
index           termName pres,eq

database        ldbm
loglevel        0
directory       /home/tuatara/TuataraServer/var/openldap-ldbm-CMDB-spellchecker.medicinenet-20130106-06_20_31_PM
suffix          "o=CMDB-spellchecker.medicinenet"
suffix          "dc=CMDB-spellchecker.medicinenet,dc=com"
rootdn          "cn=admin,o=CMDB-spellchecker.medicinenet"
rootpw          tuatara
schemacheck     on
lastmod         off
sizelimit       100000
defaultaccess   read
dbnolocking
dbnosync
cachesize       100000
dbcachesize     1000000
dbcacheNoWsync
index           objectclass pres,eq
index           default pres,eq
index           termName pres,eq

database        ldbm
loglevel        0
directory       /home/tuatara/TuataraServer/var/openldap-ldbm-CMDB-nasdaq-20131127-12_37_43_PM
suffix          "o=CMDB-nasdaq"
suffix          "dc=CMDB-nasdaq,dc=com"
rootdn          "cn=admin,o=CMDB-nasdaq"
rootpw          tuatara
schemacheck     on
lastmod         off
sizelimit       100000
defaultaccess   read
dbnolocking
dbnosync
cachesize       100000
dbcachesize     100000000
dbcacheNoWsync
index           objectclass pres,eq
index           default pres,eq
index           termName pres,eq

As was already mentioned, sed is built for this kind of stuff, but you could do it in python with something like this: 正如已经提到的, sed是为这种东西而构建的,但是您可以在python中使用类似这样的东西来实现:

with open('nasdaq.txt') as fin, open('nonasdaq.txt', 'w') as fout:
    for line in fin:
        if 'nasdaq' not in line:
            fout.write(line)

All it does is loop over the lines of the input file, and copies them to the output file if they don't contain the string 'nasdaq'. 它所做的只是在输入文件的行上循环,如果它们不包含字符串“ nasdaq”,则将它们复制到输出文件。

This should fit your need, I think: 我认为这应该符合您的需求:

import re

pat = '(?:^(?![\t ]*\r?\n).+\n)*?'\
      '.*nasdaq.*\n'\
      '(?:^(?![\t ]*\r?\n).+\n?)*'

filename = 'to_define.txt'

with open(filename,'rb+') as f:
    content = f.read()
    f.seek(0,0)
    f.write(re.sub(pat,'',content,flags=re.M))


    f.truncate()

It works only if sections are really separated with at least a void line (it may be a line '\\n' or a line ' \\t \\n' with blanks and tabs, it doesn't matter) 仅当部分之间至少用空行真正分隔时才有效(它可以是行'\\ n'或行'\\ t \\ n'包含空格和制表符,这没关系)

.

'(?:^(?![ \t]*\r?\n).+\n)*?'\
'.*nasdaq.*\n'\
'(?:^(?![ \t]*\r?\n).+\n?)*'

[\\t ] means a character that can be either a tab or a blank [\\t ]表示可以是制表符或空格的字符
[\\t ]* means a character, that can be either a tab or a blank, is repeated 0 or more times [\\t ]*表示一个字符,可以是制表符或空格,可以重复0次或更多次
(?! begins an negative lookahead assertion (?!开始否定的超前断言
(?= begins a positive lookahead assertion (?=开始肯定的前瞻性断言
(?![\\t ]*\\r?\\n) means there must not be the following sequence after this position: a succession of zero or more 'blank or tab' , a character \\r (that may be absent) and the character newline \\n (?![\\t ]*\\r?\\n)表示在此位置之后必须没有以下顺序:连续零个或多个“空白或制表符”,字符\\ r(可能不存在)和字符换行符\\ n
When I employ the word 'position' it means the location between two characters. 当我使用“位置”一词时,它表示两个字符之间的位置。
An assertion means something from the position it is placed. 断言意味着从放置位置开始的某些事情。
In the above RE, the negative lookahead assertion is placed after the symbol ^ which means position before the first character of a line . 在上述RE中,否定超前断言放置在符号^ ,表示位置在行的第一个字符之前
So the above assertion, as it is placed, means: from the position situated at the beginning of a line, there must not be a sequence 0 or more tab/blank-potential \\r-\\n . 因此,以上声明所处的位置意味着: 从位于行开头的位置开始,不得存在序列0 or more tab/blank-potential \\r-\\n
Note that the symbol ^ means "beginning of a line " only if the flag re.MULTILINE is activated. 注意,符号^表示“ 的开始”只有标志re.MULTILINE被激活。

Now the partial RE (?! *\\r?\\n) is situated inside the following RE : 现在,部分RE (?! *\\r?\\n)位于以下RE内部:
(?:^.+\\n)*?
Normally, (...) defines a capturing group. 通常, (...)定义捕获组。
The consequence of puting ?: at the beginning between parens is that these parens no more define a capturing group. 在括号之间放置?:的结果是,这些括号不再定义捕获组。 But (?:......) is usefull to define a RE. 但是(?:......)有助于定义RE。

Here .+\\n means a succession of any character (except \\n ) and then a \\n . 在这里.+\\n表示任何字符的连续字符( \\n除外),然后是\\n

And ^.+\\n (with flag re.M activated) means from the beginning of a line, the succession of any character except a newline and a newline ^.+\\n (激活了re.M标志)表示从一行的开头开始,除换行符和换行符外的任何字符的继承
Note that, as a dot . 请注意,作为点. matches with any character except \\n , we are sure that .+ can't matches with a sequence going beyond the end of the line which is signaled by \\n . \\n以外的任何字符匹配,我们确定.+不能与超出\\n表示的行末尾的序列匹配。
Then ^.+\\n defines a line in fact ! 然后^.+\\n实际上定义了一行!

Now what we have ? 现在我们有什么?
There's a * after the uncatching group. 取消追捕组后面有一个* It means that the substrings matching (?:^.+\\n) are repeated 0 or more times: that is to say we match a succession of lines. 这意味着匹配(?:^.+\\n)的子字符串重复0次或更多次:也就是说,我们匹配一系列行。

But not any line, since there's the negative lookahead assertion, which you now know the signification of. 但没有任何一行,因为存在否定的超前断言,您现在知道其含义。
So, what is matched by the RE (?:^(?![\\t ]*\\r?\\n).+\\n)* is : a succession of lines among which there is no void line . 因此,RE (?:^(?![\\t ]*\\r?\\n).+\\n)*匹配的是: 连续的几行,其中没有空行 A void line being either \\n or `\\t\\t\\n or \\t \\t \\n etc (I can't represent a line with only blanks in it , on srackoverflow, but it's also void line) 无效行是\\n`\\t\\t\\n\\t \\t \\n等(在srackoverflow上,我不能代表仅包含空格的行,但它也是无效行)

The question mark at the end of this RE means that the progression of the regex motor that matches such non-void lines one after the other must STOP as soon as it encounters the following RE. 此RE末尾的问号表示,正则表达式电机的相继匹配此类无空隙线的行进,必须在遇到下一个RE时立即停止。
And the following RE is .*nasdaq.*\\n which means a line in which there is the word 'nasdaq' 后面的RE是.*nasdaq.*\\n ,表示其中有单词“ nasdaq”的行

There are some more subtleties but I will stop here. 还有更多微妙之处,但我将在这里停止。
I think the rest will also be more understandble for you. 我认为其余的内容对您也将更易理解。

.

EDIT 编辑

In case a section would be the last one and its last line would have nasdaq in it, it wouldn't be catched and deleted by the above regex. 如果某个节是最后一个节,并且最后一行包含nasdaq,则上述正则表达式不会捕获和删除该节。
To correct this, the part .*nasdaq.*\\n must be replaced with .*nasdaq.*(\\n|\\Z) in which \\Z means the very end of the string. 若要更正此问题,必须将.*nasdaq.*\\n部分替换为.*nasdaq.*(\\n|\\Z) ,其中\\Z表示字符串的最后。

I also added a part to the regex to catched the void lines after each section, so the file is cleaned of these lines. 我还在正则表达式中添加了一个部分,以在每一节之后捕获空行,因此清除了这些行。

pat = '(?:^(?![\t ]*\r?\n).+\n)*?'\
      '.*?nasdaq.*(\n|\Z)'\
      '(?:^(?![\t ]*\r?\n).+\n?)*'\
      '(?:[\t ]*\r?\n)*'
with open('nasdaq.txt','r') as f:
    text = [l for l in f.read().splitlines()]

text = text[9:] # get rid of include headers
n = 20 # yours chunks are about this size

# sort chunks into list of lists
groups = []
for i in range(0, len(text), n):
    chunk = text[i:i+n]
    groups.append(chunk)

# get rid of unwanted lists by keyword
for ind,g in enumerate(groups):
    if any('nasdaq' in x for x in g):
        toss = groups.pop(ind)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM