简体   繁体   中英

Extract Filename before date Bash shellscript

I am trying to extract a part of the filename - everything before the date and suffix. I am not sure the best way to do it in bashscript. Regex?

The names are part of the filename. I am trying to store it in a shellscript variable. The prefixes will not contain strange characters. The suffix will be the same. The files are stored in a directory - I will use loop to extract the portion of the filename for each file.

Expected input files:

EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out

Expected Extract:

EXAMPLE_FILE
EXAMPLE_FILE_2

Attempt:

  filename=$(basename "$file")
  folder=sed '^s/_[^_]*$//)' $filename
  echo 'Filename:' $filename
  echo 'Foldername:' $folder
$ cat file.txt
EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out
$


$ cat file.txt | sed 's/_[0-9]*-[0-9]*-[0-9]*\.out$//'
EXAMPLE_FILE
EXAMPLE_FILE_2
$

No need for useless use of cat, expensive forks and pipes. The shell can cut strings just fine:

$ file=EXAMPLE_FILE_2_2017-10-12.out
$ echo ${file%%_????-??-??.out}
EXAMPLE_FILE_2

Read all about how to use the %%, %, ## and # operators in your friendly shell manual.

Code

See this code in use here

^\w+(?=_)

Results

Input

EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out

Output

EXAMPLE_FILE
EXAMPLE_FILE_2

Explanation

  • ^ Assert position at start of line
  • \\w+ Match any word character ( a-zA-Z0-9_ ) between 1 and unlimited times
  • (?=_) Positive lookahead ensuring what follows is an underscore _ character

Simply with sed :

sed 's/_[^_]*$//' file

The output:

EXAMPLE_FILE
EXAMPLE_FILE_2

----------

In case of iterating through the list of files with extension .out - bash solution:

for f in *.out; do echo "${f%_*}"; done

Bash itself has regex capability so you do not need to run a utility. Example:

for fn in *.out; do
    [[ $fn =~ ^(.*)_[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} ]]
    cap="${BASH_REMATCH[1]}"
    printf "%s => %s\n" "$fn" "$cap"
done

With the example files, output is:

EXAMPLE_FILE_2017-09-12.out => EXAMPLE_FILE
EXAMPLE_FILE_2_2017-10-12.out => EXAMPLE_FILE_2

Using Bash itself will be faster, more efficient than spawning sed , awk , etc for each file name.

Of course in use, you would want to test for a successful match:

for fn in *.out; do
    if [[ $fn =~ ^(.*)_[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} ]]; then
       cap="${BASH_REMATCH[1]}"
       printf "%s => %s\n" "$fn" "$cap"
    else
        echo "$fn no match"
    fi  
done

As a side note, you can use Bash parameter expansion rather than a regex if you only need to trim the string after the last _ in the file name:

for fn in *.out; do
    cap="${fn%_*}"
    printf "%s => %s\n" "$fn" "$cap"
done

And then test $cap against $fn . If they are equal, the parameter expansion did not trim the file name after _ because it was not present.

The regex allows a test that a date-like string \\d\\d\\d\\d-\\d\\d-\\d\\d is after the _ . Up to you which you need.

awk -F_ 'NF-=1' OFS=_ file

EXAMPLE_FILE
EXAMPLE_FILE_2

Could you please try awk solution too, which will take care of all the .out files, note this has ben written and tested in GNU awk.

awk  --re-interval 'FNR==1{if(val){close(val)};split(FILENAME, array,"_[0-9]{4}-[0-9]{2}-[0-9]{2}");print array[1];val=FILENAME;nextfile}' *.out

Also my awk version is old so I am using --re-interval, if you have latest version of awk you may need not to use it then.

Explanation and Non-one liner fom of solution: Adding a non-one liner form of solution too here with explanation.

awk --re-interval '##Using --re-interval for supporting ERE in my OLD awk version, if OP has new version of awk it could be removed.
FNR==1{            ##Checking here condition that when very first line of any Input_file is being read then do following actions.
  if(val){         ##Checking here if variable named val value is NOT NULL then do following.
      close(val)   ##close the Input_file named which is stored in variable val, so that we will NOT face problem of TOO MANY FILES OPENED, so it will be like one file read close it in background then.
};
  split(FILENAME, array,"_[0-9]{4}-[0-9]{2}-[0-9]{2}");##Splitting FILENAME(which will have Input_file name in it) into array named array only, whose separator is a 4 digits-2 digits- then 2 digits, actually this will take care of YYYY-MM-DD format in Input_file(s) and it will be easier for us to get the file name part.
  print array[1];  ##Printing array 1st element here.
  val=FILENAME;    ##Storing FILENAME variable value which will have current Input_file name in it to variable named val, so that we could close it in background.
  nextfile         ##nextfile as it name suggests it will skip all the lines in current line and jump onto the next file to save some cpu cycles of our system.
}
' *.out            ##Mentioning all *.out Input_file(s) here.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM