简体   繁体   中英

Parsing and removing a matched string up to and including 3 escaped characters

I have a file that is a 10000 line perl variable. This variable defines apps and their given dependencies. here is what that file looks like:

'im-an-app' =>
{    
    do-this =>
    {    
        needs => [ 'ruby', 'jboss', 'jquery' ],
        process =>
        [    
            { say => 'hi' },
            { speak => 'loudly' },
            { read => 'qucikly' },
        ],
    },
},

'im-an-app2' =>
{
    do-this =>
    {
        needs => [ 'ruby' ], # there is a comment here
    },
},

'im-an-app3' =>
{
    needs =>
    {
        requires => [ 'ruby', 'jboss', 'jquery', 'sass' ],
        process =>
        [
            { say => 'hi' },
            { speak => 'loudly' },
            { read => 'quickly' },
        ],
    },
},

I have a list of the apps i'd like to remove from the file in a seperate list.txt file that looks like:

im-an-app1
im-an-app3
im-an-app16
im-an-app17
im-an-app29

These apps are all different names and i'm using placeholders, I have about 500 i need to iterate over, match, and remove from my app file.

I've loaded the file IRB and when i read the file I get output in a format like this:

instances\\n\\t##\\n\\n\\t'im-an-app' =>\\n\\t{\\n\\t\\tdo-this =>\\n\\t\\t{\\n\\t\\t\\tneeds => [ 'ham-and-cheese-sandwich' ],\\n\\t\\t},\\n\\t},\\n\\n\\t'im-the-next-app' =>\\n\\t{\\n\\t\\tneeds =>\\n\\t\\t{\\n\\t\\t\\t# im a comment about this app\\n\\t\\t\\t# im another comment\\n\\t\\t\\tneeds => [ 'backlava', 'cand-corns', 'popscicles', 'yum-yum-bars', 'the-bomb-sauce', 'corndogs' ],\\n\\t\\t\\tdo-this =>\\n\\t\\t\\t[\\n\\t\\t\\t\\t{ say => 'hi' },\\n\\t\\t\\t\\t{ say => 'bye' },\\n\\t\\t\\t\\t{ yell => 'i-love-gold' },\\n\\t\\t\\t],\\n\\t\\t},\\n\\t},\\n\\n\\t'im-the-third-app' =>\\n\\t{\\n\\t\\tdothis =>\\n\\t\\t{\\n\\t\\t\\tneeds => [ 'junk', 'jazz', 'json', 'jiffylube ],\\n\\t\\t\\tprocess =>\\n\\t\\t\\t[\\n\\t\\t\\t\\t{ say => 'hi' },\\n\\t\\t\\t\\t{ say => 'bye' },\\n\\t\\t\\t\\t{ say => 'goonies' },\\n\\t\\t\\t],\\n\\t\\t},\\n\\t},\\n\\n\\t'im-yet-anotherapp'

I have noticed that the only constant delimiter is a \\n\\n\\t that exists only before the definition of the new app. I'd like to search through the read file, delete the reference to each application in my list and all of its subsequent information up to and including the \\n\\n\\t.

I'm using Ruby and IRB to do this but I'm open to using other forms of manipulation.

Thanks guys!

If you wanted python, this may be a start (untested so may have bugs):

import re
with open( 'yourfilename', 'r' ) as f:
    data = f.read().split('\n\n\t')
    # then you can use some regex to find what you want.
    for entry in data:
        reres = re.search( 'yourpattern', entry )
        if reres:
            del entry
    # Save the results to another file?
    with open( 'outputfile', 'wt' ) as fout:
        fout.write( "\n\n\t".join( data ) )

( EDIT : Updated based on new sample data)

This awk, demonstrated below, loads the list of apps into an array (adding surrounding single quotes so as to match the app file). Then for the app file it changes the record separator to be one or more blank lines ( RS="" ). For each record it prints only the ones that weren't in the list:

$ awk -v ORS="\n\n" -v q="'" 'NR==FNR{a[q $1 q];next} !($1 in a)' app-list.txt RS="" apps.txt

Explanation

-v ORS="\n\n"

Set the Output Record Separator to keep an extra newline between app records when writing them out.

-v q="'"

This is just convenient way to be able to use a literal single quote in the one-liner, which since itself is surrounded by single quotes can otherwise be a pain.

NR==FNR{a[q $1 q];next}

When NR==FNR we are reading the first file, the list of apps (go to http://backreference.org/2010/02/10/idiomatic-awk/ and serach for "Two-file Processing"). For each app in the list, surround it with single quotes and enter it into the array a .

!($1 in a)

Once we get here we know we are reading the apps file (again, see above link). In this file each app block is considered a single record (see RS="" , below). $1 is the name of the app in quotes. We check to see if the name is in the array a and if not , we perform the default action, which is to simply print the record.

app-list.txt RS="" apps.txt

These are the files to be processed. Awk allows you to change RS , the record separator between files. For the app list, the defaults are fine, but for the apps themselves we set the record separator to the empty string. As the docs say, " By a special dispensation, an empty string as the value of RS indicates that records are separated by one or more blank lines ", which is quite convenient for this application.

Demonstration :

$ cat app-list.txt
im-an-app1
im-an-app3
im-an-app16
im-an-app17
im-an-app29


$ cat apps.txt
'im-an-app1' =>
{
    do-this =>
    {
        needs => [ 'ruby', 'jboss', 'jquery' ],
        process =>
        [
            { say => 'hi' },
            { speak => 'loudly' },
            { read => 'qucikly' },
        ],
    },
},

'im-an-app2' =>
{
    do-this =>
    {
        needs => [ 'ruby' ], # there is a comment here
    },
},

'im-an-app3' =>
{
    needs =>
    {
        requires => [ 'ruby', 'jboss', 'jquery', 'sass' ],
        process =>
        [
            { say => 'hi' },
            { speak => 'loudly' },
            { read => 'quickly' },
        ],
    },
},


$ awk -v ORS="\n\n" -v q="'" 'NR==FNR{a[q $1 q];next} !($1 in a)' app-list.txt RS="" apps.txt

'im-an-app2' =>
{
    do-this =>
    {
        needs => [ 'ruby' ], # there is a comment here
    },
},

This could be done in Python as follows:

import re

remove = set(['im-an-ap', 'im-an-ap-5', 'im-an-ap-10'])

def replace(re_app):
    if re_app.group(2) in remove:
        return ""
    else:
        return re_app.group(1)

with open('input.txt') as f_input, open('output.txt', 'w') as f_output:
    f_output.write(re.sub(r"(^'(.*?)' =.*?(?=\n\n\t|\Z))", replace, f_input.read(), flags=re.S+re.M))

This will load the file input.txt , remove all of the unwanted entries and create a new file called output.txt .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM