Parsing, reformating log file using sed or perhaps a script?

Question

I have a case, a file which I need to post-process . The sample format is given below:-

bigspeedpro.com Intel::DOMAIN   from http://malc0de.com/bl/BOOT via intel.criticalstack.com     F
1.1.1.1 Intel::DOMAIN   from http://abcd.com/bl/BOOT via intel.criticalstack.com     F

Expected output is :--

1.1.1.1 abcd

Parsing is as:-

Anything which doesn't start with IP address delete that line
If start with IP address do
- delete Intel::DOMAIN
- between from to F replace it based upon following strings occurrences
- eg malc0de or abcd

I want to use, sed but I don't know If sed can be used to match multiple strings eg malc0de or abc perhaps I need a more complete script then just one-liner storing strings values in array. Any idea? By the way, examples using sed be most welcomed.

So far

I know using d in sed I can delete the line and redirect output to a file
I know how to match a regex for not IP address [^a-zA-Z]
I'm stuck in replacing based upon multiple choice or strings

 \\#!/bin/bash sed -is/\\[a-zA-Z]\\/d test ./infile > testme.txt sed -is/\\([0-9]\\{1,3\\}\\.\\)\\{3\\}[0-9]\\{1,3\\}/s+\\Intel::DOMAIN\\\\s*from(.*?)\\s+F\\1malc0de

Or I'm thinking of saving like ARRAY=(malc0de abcd)

then in place of capturing group I can do ${ARRAY[2]} will it work?Or I can do something Like in .net substring match between from and F I copy result in string variable. Then search it for my strings eg malc0de if do find replace the searched pattern with matched result? But I don't know bash...

update With the awk script I'm this clean

 1.1.1.1 www.abc.com 1.1.2.2 def.com 2.2.2.2 mnx.dbc.net

However, I want second column after ip address to be shortened to a string of my own choice for eg in second column I only accept

abc def mnx

Once, its found just replace entire string as

1.1.1.1 abc
1.1.2.2 def
2.2.2.2 mnx

Thanks.

Answer 1

You mentioned that sed solutions are most welcome, but I believe awk would be most easy to use for your particular task. Here's my solution:

awk '/^[[:digit:]]\.[[:digit:]]\.[[:digit:]]\.[[:digit:]]/ { printf $1; gsub (/http\:\/\//," "); gsub(/\.com/," ");printf " "$4"\n"  }' inputFile.txt

The idea is simple: by default awk has field separator that is blank space and allows printing specific fields, thus first we match lines that start with an ip address (four digit-dot alternating patterns); we print first field, then get rid of https and .com parts, and the domain name is the only thing that is left , thus becomes filed 4, which we print next. The rest is not specified to be printed , hence ignored.

If you want the original file to be edited, awk however has a quirk in that it cannot do in-line editing, unless that's gawk (GNU awk), so use temp file for that purpose.

Demo:

my input file

xieerqi:$ cat inputFile.txt                                               
bigspeedpro.com Intel::DOMAIN   from http://malc0de.com/bl/BOOT via intel.criticalstack.com     F
1.1.1.1 Intel::DOMAIN   from http://abcd.com/bl/BOOT via intel.criticalstack.com     F

whatever.com Intel::DOMAIN   from http://malc0de.com/bl/BOOT via intel.criticalstack.com     F
2.2.2.2 Intel::DOMAIN   from http://asdf.com/bl/BOOT via intel.criticalstack.com     F

Command with temp file transfer (notice my inputFile.txt is in my home directory, adjust that part accordingly). NOTE: always always always have backup of the original file just in case! Or run the first part of the command before && , check temp file, and if you like it, cat the file into the original one.

awk '/^[[:digit:]]\.[[:digit:]]\.[[:digit:]]\.[[:digit:]]/ { printf $1; gsub (/http\:\/\//," "); gsub(/\.com/," ");printf " "$4"\n"  }' inputFile.txt > /tmp/temp.txt && cat /tmp/temp.txt > $HOME/inputFile.txt

output after the command ran:

xieerqi:$ awk '/^[[:digit:]]\.[[:digit:]]\.[[:digit:]]\.[[:digit:]]/ { printf $1; gsub (/http\:\/\//," "); gsub(/\.com/," ");printf " "$4"\n"  }' inputFile.txt > /tmp/temp.txt && cat /tmp/temp.txt > $HOME/inputFile.txt


xieerqi:$ cat inputFile.txt                                                                                                                           
1.1.1.1 abcd
2.2.2.2 asdf

Simplification through scripting

The above command can be placed into a script with the following contents:

#!/usr/bin/awk -f

/^[[:digit:]]*\.[[:digit:]]*\.[[:digit:]]*\.[[:digit:]]*/ { 
printf $1; 

gsub (/http\:\/\//," "); 
gsub (/https\:\/\//," "); 
gsub(/\.com/," ");
printf " "$4"\n";
}

Notice that in the script I've considered possibility of multiple digits in an ip address as well as possibility of https in the address.

Remember to make script executable with chmod 755 /path/to/script

Here's the demo:

xieerqi:$ chmod 755 ipanddomain.awk                                                                                                                   

xieerqi:$ cat inputFile.txt                                                                                                                           
bigspeedpro.com Intel::DOMAIN   from http://malc0de.com/bl/BOOT via intel.criticalstack.com     F
1.1.1.1 Intel::DOMAIN   from http://abcd.com/bl/BOOT via intel.criticalstack.com     F

whatever.com Intel::DOMAIN   from http://malc0de.com/bl/BOOT via intel.criticalstack.com     F
192.168.0.2 Intel::DOMAIN   from https://asdf.foobar.whatever.com/bl/BOOT via intel.criticalstack.com     F

xieerqi:$ ./ipanddomain.awk inputFile.txt                                                                                                             
1.1.1.1 abcd
192.168.0.2 asdf.foobar.whatever

To edit the original file, use the trick with redirection to temp file and back to original like I showed you before

Edit #2

So you've asked: can simply matching part of the domain name that you already know be just printed. I've edited my script a little bit. Basically, this version looks for pattern in the $4 field and if it finds it, it goes "OK, that string has abcd in it, so I'll just print that"

#!/usr/bin/gawk -f

/^[[:digit:]]*\.[[:digit:]]*\.[[:digit:]]*\.[[:digit:]]*/ { 
printf $1" ";
matchDomain($4);
}

function matchDomain(str){

if (str~/foobar/)
 printf "foobar\n";
if(str~/abcd/)
 printf "abcd\n"

}

Answer 2

Try out this small guy:

sed -nE 's/(^[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}) .* [htpsw:\/.]{4,8}([0-9a-z.]+)\.com.*$/\1 \2/p' > newfile

Idea is to use grouping () , define proper groups and than replace matched lines with groups only using \\1 \\2 etc. -np combination is used to display only replaced lines and lines are replaced only if match the pattern. If you want to keep not matching lines as well remove -np

Input file:

bigspeedpro.com Intel::DOMAIN   from http://malc0de.com/bl/BOOT via intel.criticalstack.com     F
1.1.1.1 Intel::DOMAIN   from http://abcd.com/bl/BOOT via intel.criticalstack.com     F
bigspeedpro.com Intel::DOMAIN   from http://malc0de.com/bl/BOOT via intel.criticalstack.com     F
123.1.1.1 Intel::DOMAIN   from http://abcd12.bcd.com/bl/BOOT via intel.criticalstack.com     F
bigspeedpro.com Intel::DOMAIN   from https://malc0de.com/bl/BOOT via intel.criticalstack.com     F
87.1.4.1 Intel::DOMAIN   from http://abcdtdd.com/bl/BOOT via intel.criticalstack.com     F
bigspeedpro.com Intel::DOMAIN   from http://malc0de.com/bl/BOOT via intel.criticalstack.com     F
192.168.1.1 Intel::DOMAIN   from www.abcdbc12a.bdf12.com/bl/BOOT via intel.criticalstack.com     F

Output newfile:

1.1.1.1 abcd
123.1.1.1 abcd12.bcd
87.1.4.1 abcdtdd
192.168.1.1 abcdbc12a.bdf12

Update: I updated my answer, changed sed a little bit, now it can handle http/https/www and will return what in between https/https/www and .com . And it still relatively short onliner.

Parsing, reformating log file using sed or perhaps a script?

Question

2 answers

solution1
5 ACCPTED 2015-10-17 06:51:14

solution2
2 2015-10-17 04:02:12

Parsing, reformating log file using sed or perhaps a script?

Question

2 answers

solution1 5 ACCPTED 2015-10-17 06:51:14

solution2 2 2015-10-17 04:02:12

solution1
5 ACCPTED 2015-10-17 06:51:14

solution2
2 2015-10-17 04:02:12