简体   繁体   中英

sed/regex pattern to search and replace numbers in a filename

I have a set of 3 files, encoded with a date:

abc1_bbb_yyy_2_8_15.csv
abd1_bba_yzy_11_8_16.csv
aby1_qba_yay_11_21_16.csv

The last three numbers represent the date:

2815
11816
112116

I need extract just the digits corresponding to the date in the filename using a single regex filter that also transforms the result into MMDDYY format:

020815
110816
112116

Thanks for your help!

awk -F'[_.]' '{printf "%02d%02d%02d\n",$(NF-3),$(NF-2),$(NF-1)}'

This seemed like a fun question to try and solve with sed.

I kind of like TessellatingHeckler's perl approach better. :-)

edit: Having slept on it, I like jthill's awk approach even better.
Technically interesting to try solving with sed but not something I'd want to live with long term.

foo.dat

Sample data file...

$ cat foo.dat
abc1_bbb_yyy_2_8_15.csv
abd1_bba_yzy_11_8_16.csv
aby1_qba_yay_11_21_16.csv
$

Sample results

note that sed -r enables regular epxression extensions.

$ sed -rf foo.sed < foo.dat
020815
110816
112116
$ 

foo.sed

Normally I'm not nearly this verbose. :-)

But I thought the comments would make the purpose more clear.

# Put a wedge between "prefix" and "date.CSV" part.
# We don't salvage the .csv extension, that drops off here.
# Note the space padding before/after \1, we'll use that shortly.
s/([0-9_]+)\.csv/ \1 /g
#    in:  "abc1_bbb_yyy_2_8_15.csv"
#    out: "abc1_bbb_yyy _2_8_15 "
# (If I knew how to do non-greedy matching in sed we could
# strip the prefix e.g. "abc1_bb_yyy" part here as well,
# but if we try that we end up with just "_15 ", e.g. our
# other month & day get eaten).
# Hence sacrificial space character that our
# next substitution will use to cut the prefix.

# Cut the prefix.
# strip up to, but not including, the first non-space char.
# (I don't think you can do non-greedy matching in sed).

s/^.* ([^ ])/\1/
#    in:  "abc1_bbb_yyy _2_8_15 "
#    out:              "_2_8_15 "

# change our underscores to two space chars.
# (turns out we need two intermediate spaces for
# the next substitution to work as a single "global" substitution)
s/_/  /g
#    in:   "_2_8_15 "
#    out:  "  2  8  15 "
# At this point all of our month/day/year parts 
# have *two* spaces between them.

# Next we do zero-padding if necessary.
s/ ([0-9]) / 0\1 /g
# Important: we're looking for a single space before
# and after any single digit.
#    in:  "  2  8  15 "
#   out:  " 02 08  15 "
# input broken out by single chars with "spc"= 1 space char.
#         +---+---+---+---+---+---+---+---+---+---+---+
# input:  |spc|spc| 2 |spc|spc| 8 |spc|spc| 1 | 5 |spc|
#         +---+---+---+---+---+---+---+---+---+---+---+
#              \         / \         /     no match, not
#               \       /   \       /      a single digit.
#                \     /     \     /
#                match 1     match 2
#               /       \   /       \
#              /         \ /         \
#         +---+---+---+---+---+---+---+---+---+---+---+
# result  |spc| 0 | 2 |spc| 0 | 8 |spc|spc| 1 | 5 |spc|
#         +---+---+---+---+---+---+---+---+---+---+---+
# Without "two spaces" between digits this
# would require 3 separate substitutions...
# doing a single global e.g. s/ ([0-9]) / 0\1 /

# Pretty much done, just strip the spaces.
s/ //g
#   in:   " 02 08  15 "
#   out:  "020815"

As others have pointed out, sed is not the most elegant tool for this job. With perl,

fn='abc1_bbb_yyy_2_8_15.csv abd1_bba_yzy_11_8_16.csv aby1_qba_yay_11_21_16.csv'
for x in $fn; do
  echo $x | perl -n -e 'printf("%02d%02d%02d\n",/(\d+)_(\d+)_(\d+)\./)'
done

If you really are constrained to use sed , then here's a way. The first regex prepends a zero to digits preceded by underscore. The second looks for strings of digits followed by underscore or dot and removes all but the last 2 digits in each occurrence. The last extracts a final string of 6 digits preceded by anything but followed by non-digits.

for x in $fn; do
  echo $x | sed -e 's/_\([0-9]\)/_0\1/g' \
    -e 's/[0-9]*\([0-9]\{2\}\)[_.]/\1/g' \
    -e 's/.*\([0-9]\{6\}\)[^0-9]*$/\1/'
done

Result:

$ for x in $fn; do
>       echo $x | sed -e 's/_\([0-9]\)/_0\1/g' \
>         -e 's/[0-9]*\([0-9]\{2\}\)[_.]/\1/g' \
>         -e 's/.*\([0-9]\{6\}\)[^0-9]*$/\1/'
>     done
020815
110816
112116

Try this:

REST=cat # Whatever the rest of your pipeline is...

( cat <<EOF
abc1_bbb_yyy_2_8_15.csv
abd1_bba_yzy_11_8_16.csv
aby1_qba_yay_11_21_16.csv
EOF
)\
| cut -d_ -f4-6 \
| cut -d. -f1 \
| sed -e 's/\([0-9][0-9]*\)/0\1/g' \
    -e 's/0\([0-9][0-9]\)/\1/g' \
    -e 's/_//g' \
| $REST

Put the filenames into t.txt

abc1_bbb_yyy_2_8_15.csv
abd1_bba_yzy_11_8_16.csv
aby1_qba_yay_11_21_16.csv

Then

$ cat t.txt | perl -p -e 's/(?<=_)(\d)(?=_)/0\1/g' | perl -p -e 's/.*(\d\d)_(\d\d)_(\d\d)\.csv/\1\2\3/'
020815
110816
112116

This isn't exactly sed/awk/grep, because sed can't do lookarounds and I don't want to AWK right now, but it is regex, and *nixy.

[Edit: Okay downvoters who don't like Perl, my approach was to prefix single digits with 0 first, then extract the double digit pairs. sed makes that hard with no lookarounds or non-capturing groups, But here is a sed answer using @jgreve's idea of putting a wedge in first. This also includes output in YYYYMMDD format, assuming all years are 20:

#                  #wedge        #single n to 0n            #extract __dd__mm__yy                                   to 20yymmdd
cat t.txt | sed -e 's/_/__/g' -e 's/_\([0-9]\)_/_0\1_/g' -e 's/.*__\([0-9][0-9]\)__\([0-9][0-9]\)__\([0-9][0-9]\)\.csv/20\3\2\1/'

]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM