I have a case, a file which I need to post-process
. The sample format is given below:-
bigspeedpro.com Intel::DOMAIN from http://malc0de.com/bl/BOOT via intel.criticalstack.com F
1.1.1.1 Intel::DOMAIN from http://abcd.com/bl/BOOT via intel.criticalstack.com F
Expected output is :--
1.1.1.1 abcd
Parsing is as:-
If start with IP address do
from
to F
replace it based upon following strings occurrences I want to use, sed but I don't know If sed
can be used to match multiple strings eg malc0de or abc perhaps I need a more complete script then just one-liner storing strings values in array. Any idea? By the way, examples using sed
be most welcomed.
So far
d
in sed I can delete the line and redirect output to a file \\#!/bin/bash sed -is/\\[a-zA-Z]\\/d test ./infile > testme.txt sed -is/\\([0-9]\\{1,3\\}\\.\\)\\{3\\}[0-9]\\{1,3\\}/s+\\Intel::DOMAIN\\\\s*from(.*?)\\s+F\\1malc0de
Or I'm thinking of saving like ARRAY=(malc0de abcd)
then in place of capturing group I can do ${ARRAY[2]} will it work?Or I can do something Like in .net substring match between from
and F
I copy result in string variable. Then search it for my strings eg malc0de if do find replace the searched pattern with matched result? But I don't know bash...
update With the awk script I'm this clean
1.1.1.1 www.abc.com 1.1.2.2 def.com 2.2.2.2 mnx.dbc.net
However, I want second column after ip address to be shortened to a string of my own choice for eg in second column I only accept
abc def mnx
Once, its found just replace entire string as
1.1.1.1 abc
1.1.2.2 def
2.2.2.2 mnx
Thanks.
You mentioned that sed
solutions are most welcome, but I believe awk
would be most easy to use for your particular task. Here's my solution:
awk '/^[[:digit:]]\.[[:digit:]]\.[[:digit:]]\.[[:digit:]]/ { printf $1; gsub (/http\:\/\//," "); gsub(/\.com/," ");printf " "$4"\n" }' inputFile.txt
The idea is simple: by default awk
has field separator that is blank space and allows printing specific fields, thus first we match lines that start with an ip address (four digit-dot alternating patterns); we print first field, then get rid of https
and .com
parts, and the domain name is the only thing that is left , thus becomes filed 4, which we print next. The rest is not specified to be printed , hence ignored.
If you want the original file to be edited, awk
however has a quirk in that it cannot do in-line editing, unless that's gawk
(GNU awk), so use temp file for that purpose.
Demo:
my input file
xieerqi:$ cat inputFile.txt
bigspeedpro.com Intel::DOMAIN from http://malc0de.com/bl/BOOT via intel.criticalstack.com F
1.1.1.1 Intel::DOMAIN from http://abcd.com/bl/BOOT via intel.criticalstack.com F
whatever.com Intel::DOMAIN from http://malc0de.com/bl/BOOT via intel.criticalstack.com F
2.2.2.2 Intel::DOMAIN from http://asdf.com/bl/BOOT via intel.criticalstack.com F
Command with temp file transfer (notice my inputFile.txt is in my home directory, adjust that part accordingly). NOTE: always always always have backup of the original file just in case! Or run the first part of the command before &&
, check temp file, and if you like it, cat the file into the original one.
awk '/^[[:digit:]]\.[[:digit:]]\.[[:digit:]]\.[[:digit:]]/ { printf $1; gsub (/http\:\/\//," "); gsub(/\.com/," ");printf " "$4"\n" }' inputFile.txt > /tmp/temp.txt && cat /tmp/temp.txt > $HOME/inputFile.txt
output after the command ran:
xieerqi:$ awk '/^[[:digit:]]\.[[:digit:]]\.[[:digit:]]\.[[:digit:]]/ { printf $1; gsub (/http\:\/\//," "); gsub(/\.com/," ");printf " "$4"\n" }' inputFile.txt > /tmp/temp.txt && cat /tmp/temp.txt > $HOME/inputFile.txt
xieerqi:$ cat inputFile.txt
1.1.1.1 abcd
2.2.2.2 asdf
Simplification through scripting
The above command can be placed into a script with the following contents:
#!/usr/bin/awk -f
/^[[:digit:]]*\.[[:digit:]]*\.[[:digit:]]*\.[[:digit:]]*/ {
printf $1;
gsub (/http\:\/\//," ");
gsub (/https\:\/\//," ");
gsub(/\.com/," ");
printf " "$4"\n";
}
Notice that in the script I've considered possibility of multiple digits in an ip address as well as possibility of https
in the address.
Remember to make script executable with chmod 755 /path/to/script
Here's the demo:
xieerqi:$ chmod 755 ipanddomain.awk
xieerqi:$ cat inputFile.txt
bigspeedpro.com Intel::DOMAIN from http://malc0de.com/bl/BOOT via intel.criticalstack.com F
1.1.1.1 Intel::DOMAIN from http://abcd.com/bl/BOOT via intel.criticalstack.com F
whatever.com Intel::DOMAIN from http://malc0de.com/bl/BOOT via intel.criticalstack.com F
192.168.0.2 Intel::DOMAIN from https://asdf.foobar.whatever.com/bl/BOOT via intel.criticalstack.com F
xieerqi:$ ./ipanddomain.awk inputFile.txt
1.1.1.1 abcd
192.168.0.2 asdf.foobar.whatever
To edit the original file, use the trick with redirection to temp file and back to original like I showed you before
Edit #2
So you've asked: can simply matching part of the domain name that you already know be just printed. I've edited my script a little bit. Basically, this version looks for pattern in the $4 field and if it finds it, it goes "OK, that string has abcd in it, so I'll just print that"
#!/usr/bin/gawk -f
/^[[:digit:]]*\.[[:digit:]]*\.[[:digit:]]*\.[[:digit:]]*/ {
printf $1" ";
matchDomain($4);
}
function matchDomain(str){
if (str~/foobar/)
printf "foobar\n";
if(str~/abcd/)
printf "abcd\n"
}
Try out this small guy:
sed -nE 's/(^[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}) .* [htpsw:\/.]{4,8}([0-9a-z.]+)\.com.*$/\1 \2/p' > newfile
Idea is to use grouping ()
, define proper groups and than replace matched lines with groups only using \\1 \\2
etc. -np
combination is used to display only replaced lines and lines are replaced only if match the pattern. If you want to keep not matching lines as well remove -np
Input file:
bigspeedpro.com Intel::DOMAIN from http://malc0de.com/bl/BOOT via intel.criticalstack.com F
1.1.1.1 Intel::DOMAIN from http://abcd.com/bl/BOOT via intel.criticalstack.com F
bigspeedpro.com Intel::DOMAIN from http://malc0de.com/bl/BOOT via intel.criticalstack.com F
123.1.1.1 Intel::DOMAIN from http://abcd12.bcd.com/bl/BOOT via intel.criticalstack.com F
bigspeedpro.com Intel::DOMAIN from https://malc0de.com/bl/BOOT via intel.criticalstack.com F
87.1.4.1 Intel::DOMAIN from http://abcdtdd.com/bl/BOOT via intel.criticalstack.com F
bigspeedpro.com Intel::DOMAIN from http://malc0de.com/bl/BOOT via intel.criticalstack.com F
192.168.1.1 Intel::DOMAIN from www.abcdbc12a.bdf12.com/bl/BOOT via intel.criticalstack.com F
Output newfile:
1.1.1.1 abcd
123.1.1.1 abcd12.bcd
87.1.4.1 abcdtdd
192.168.1.1 abcdbc12a.bdf12
Update: I updated my answer, changed sed a little bit, now it can handle http/https/www
and will return what in between https/https/www
and .com
. And it still relatively short onliner.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.