简体   繁体   中英

AWK column matching pattern

I'm trying to write a command to find lines where specific column in a csv file that matches the pattern. I'm struggling with pattern matching for that column

Task:

Sample csv file:

col1 col2 col3 col4 col5date col6
abcd asdd 2022 asdd 7/4/22 something
abcd asdd 2022 asdd 10/9/22 something
abcd asdd 2022 asdd 12/12/20 something
abcd asdd 2020 asdd 9/1/19 something
abcd asdd 2020 asdd 9/1/22 something
abcd asdd 2021 asdd 9/22/19 something
abcd asdd 2021 asdd 2/16/22 something
abcd asdd 2021 asdd 6/16/22 something

Expected output after command: first and last lines since the dates are june and july.

My awk command:

cat file | awk -F'|' '$5 ~ /(6|7|8)\/*\/22$/'

In the pattern "/(6|7|8)\/*\/22$/" I'm trying to say
m/d/Y - m is either 6, 7 or 8
* - for day
22$ - for year and column ends

Like this, assuming the file is csv (commas) and not tsv (tabs) or even | (pipe) separated file:

awk -F, '$5 ~ /^(6|7|8)\/.*\/22$/' file

But there's no matching line in your sample input

I would use something like

awk -F, '$5 ~ "^[6-8]/[^/]+/22$"'

so you don't have to escape '/` and also can reject malformed dates

This part of the pattern \/* repeats 0+ times a forward slash instead of matching days.

You might narrow the match for the days part to digits (which still not validate a valid date), and you can omit the cat command.

awk -F'|' '$5 ~ /^[678]\/([1-9]|[12][0-9]|3[01])\/22$/' file

If you are using a comma as a separator, then use awk -F,

Output

abcd|asdd|2022|asdd|7/4/22|something
abcd|asdd|2021|asdd|6/16/22|something

assuming the data format is correct, you can eliminate formatting matches.

$ awk -F, '$5~/^[6-8].*22$/' file

With your shown samples and attempts please try following awk code.

awk '$5~/^6\/([1-9]|1[0-9]|2[0-9]|30)\/[0-9]{2}$|^[7-8]\/([1-9]|1[0-9]|2[0-9]|3[0-1])\/[0-9]{2}$/'  Input_file

It's not clear (to me) how the input file is delimited so for demonstration purposes I'll assume comma delimited:

$ cat file
col1,col2,col3,col4,col5date,col6
abcd,asdd,2022,asdd,7/4/22,something
abcd,asdd,2022,asdd,10/9/22,something
abcd,asdd,2022,asdd,12/12/20,something
abcd,asdd,2020,asdd,9/1/19,something
abcd,asdd,2020,asdd,9/1/22,something
abcd,asdd,2021,asdd,9/22/19,something
abcd,asdd,2021,asdd,2/16/22,something
abcd,asdd,2021,asdd,6/16/22,something

One awk idea:

awk -F',' -v mon='6,7,8' -v year="22" '        # define input delimiter as comma; OP provides comma-delimited list of numerical months along with desired year

BEGIN  { split(mon,a,",")                      # split input variable "mon" on commas and place results into array a[]
         for (i in a)                          # loop through indices of array a[] and ...
             months[a[i]]                      # build new array where numerical months are the indices of the array months[]
       }
FNR==1 { print; next }                         # print header row then skip to next line of input
       { split($5,a,"/")                       # split 5th field on "/" delimiter and place results into array a[]
         if (a[1] in months && a[3] == year)   # if a[1] is an index in the months[] array and a[3] matches the desired year then ...
            print                              # print current line to stdout
       }
' file

This generates:

$ . ./dates.awk
col1,col2,col3,col4,col5date,col6
abcd,asdd,2022,asdd,7/4/22,something
abcd,asdd,2021,asdd,6/16/22,something

if you don't have any other dates in each row:

 gawk -b '1 < NF' FS='[,|][6-8][/][^/]+[/]22[,|]'

if you're VERY certain there are no erroneously formatted dates in the input, then even simpler:

 mawk '/[,|][6-8][^|]+22[,|]/'

I usually prefer using square brackets for regex special char escape, as it delineates boundaries a lot clearer visually than \\\\\\whiskey\\\\\\tango\\\\\\foxtrot\\\\ chaos in many regex es, with the added bonus of not having to worry about how many backslashes being "eaten" by each extra layer it's passed through before actually executed by awk

eg forward slash "/" :: for command line assignments --

  • mawk s okay with typing \/ , while

  • gawk/nawk need a pair — \\/ ,

  • "[/]" is all-awk variants-friendly, when it's in double-quoted string form

for regex in the open /…/ form, mawk/gawk okay with /…[/]…/ while nawk insist an extra backslash /…[\/]…/ , which somewhat defeats the readability rationale for opting for […] over \/\/\\+\\/\\\]\\[\\*\\……

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM