I have a set of 3 files, encoded with a date:
abc1_bbb_yyy_2_8_15.csv
abd1_bba_yzy_11_8_16.csv
aby1_qba_yay_11_21_16.csv
The last three numbers represent the date:
2815
11816
112116
I need extract just the digits corresponding to the date in the filename using a single regex filter that also transforms the result into MMDDYY format:
020815
110816
112116
Thanks for your help!
awk -F'[_.]' '{printf "%02d%02d%02d\n",$(NF-3),$(NF-2),$(NF-1)}'
This seemed like a fun question to try and solve with sed.
I kind of like TessellatingHeckler's perl approach better. :-)
edit: Having slept on it, I like jthill's awk approach even better.
Technically interesting to try solving with sed but not something I'd want to live with long term.
Sample data file...
$ cat foo.dat
abc1_bbb_yyy_2_8_15.csv
abd1_bba_yzy_11_8_16.csv
aby1_qba_yay_11_21_16.csv
$
note that sed -r enables regular epxression extensions.
$ sed -rf foo.sed < foo.dat
020815
110816
112116
$
Normally I'm not nearly this verbose. :-)
But I thought the comments would make the purpose more clear.
# Put a wedge between "prefix" and "date.CSV" part.
# We don't salvage the .csv extension, that drops off here.
# Note the space padding before/after \1, we'll use that shortly.
s/([0-9_]+)\.csv/ \1 /g
# in: "abc1_bbb_yyy_2_8_15.csv"
# out: "abc1_bbb_yyy _2_8_15 "
# (If I knew how to do non-greedy matching in sed we could
# strip the prefix e.g. "abc1_bb_yyy" part here as well,
# but if we try that we end up with just "_15 ", e.g. our
# other month & day get eaten).
# Hence sacrificial space character that our
# next substitution will use to cut the prefix.
# Cut the prefix.
# strip up to, but not including, the first non-space char.
# (I don't think you can do non-greedy matching in sed).
s/^.* ([^ ])/\1/
# in: "abc1_bbb_yyy _2_8_15 "
# out: "_2_8_15 "
# change our underscores to two space chars.
# (turns out we need two intermediate spaces for
# the next substitution to work as a single "global" substitution)
s/_/ /g
# in: "_2_8_15 "
# out: " 2 8 15 "
# At this point all of our month/day/year parts
# have *two* spaces between them.
# Next we do zero-padding if necessary.
s/ ([0-9]) / 0\1 /g
# Important: we're looking for a single space before
# and after any single digit.
# in: " 2 8 15 "
# out: " 02 08 15 "
# input broken out by single chars with "spc"= 1 space char.
# +---+---+---+---+---+---+---+---+---+---+---+
# input: |spc|spc| 2 |spc|spc| 8 |spc|spc| 1 | 5 |spc|
# +---+---+---+---+---+---+---+---+---+---+---+
# \ / \ / no match, not
# \ / \ / a single digit.
# \ / \ /
# match 1 match 2
# / \ / \
# / \ / \
# +---+---+---+---+---+---+---+---+---+---+---+
# result |spc| 0 | 2 |spc| 0 | 8 |spc|spc| 1 | 5 |spc|
# +---+---+---+---+---+---+---+---+---+---+---+
# Without "two spaces" between digits this
# would require 3 separate substitutions...
# doing a single global e.g. s/ ([0-9]) / 0\1 /
# Pretty much done, just strip the spaces.
s/ //g
# in: " 02 08 15 "
# out: "020815"
As others have pointed out, sed
is not the most elegant tool for this job. With perl,
fn='abc1_bbb_yyy_2_8_15.csv abd1_bba_yzy_11_8_16.csv aby1_qba_yay_11_21_16.csv'
for x in $fn; do
echo $x | perl -n -e 'printf("%02d%02d%02d\n",/(\d+)_(\d+)_(\d+)\./)'
done
If you really are constrained to use sed
, then here's a way. The first regex prepends a zero to digits preceded by underscore. The second looks for strings of digits followed by underscore or dot and removes all but the last 2 digits in each occurrence. The last extracts a final string of 6 digits preceded by anything but followed by non-digits.
for x in $fn; do
echo $x | sed -e 's/_\([0-9]\)/_0\1/g' \
-e 's/[0-9]*\([0-9]\{2\}\)[_.]/\1/g' \
-e 's/.*\([0-9]\{6\}\)[^0-9]*$/\1/'
done
Result:
$ for x in $fn; do
> echo $x | sed -e 's/_\([0-9]\)/_0\1/g' \
> -e 's/[0-9]*\([0-9]\{2\}\)[_.]/\1/g' \
> -e 's/.*\([0-9]\{6\}\)[^0-9]*$/\1/'
> done
020815
110816
112116
Try this:
REST=cat # Whatever the rest of your pipeline is...
( cat <<EOF
abc1_bbb_yyy_2_8_15.csv
abd1_bba_yzy_11_8_16.csv
aby1_qba_yay_11_21_16.csv
EOF
)\
| cut -d_ -f4-6 \
| cut -d. -f1 \
| sed -e 's/\([0-9][0-9]*\)/0\1/g' \
-e 's/0\([0-9][0-9]\)/\1/g' \
-e 's/_//g' \
| $REST
Put the filenames into t.txt
abc1_bbb_yyy_2_8_15.csv
abd1_bba_yzy_11_8_16.csv
aby1_qba_yay_11_21_16.csv
Then
$ cat t.txt | perl -p -e 's/(?<=_)(\d)(?=_)/0\1/g' | perl -p -e 's/.*(\d\d)_(\d\d)_(\d\d)\.csv/\1\2\3/'
020815
110816
112116
This isn't exactly sed/awk/grep, because sed can't do lookarounds and I don't want to AWK right now, but it is regex, and *nixy.
[Edit: Okay downvoters who don't like Perl, my approach was to prefix single digits with 0 first, then extract the double digit pairs. sed makes that hard with no lookarounds or non-capturing groups, But here is a sed answer using @jgreve's idea of putting a wedge in first. This also includes output in YYYYMMDD format, assuming all years are 20:
# #wedge #single n to 0n #extract __dd__mm__yy to 20yymmdd
cat t.txt | sed -e 's/_/__/g' -e 's/_\([0-9]\)_/_0\1_/g' -e 's/.*__\([0-9][0-9]\)__\([0-9][0-9]\)__\([0-9][0-9]\)\.csv/20\3\2\1/'
]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.