I'm trying to write a command to find lines where specific column in a csv file that matches the pattern. I'm struggling with pattern matching for that column
Task:
Sample csv file:
col1 | col2 | col3 | col4 | col5date | col6 |
---|---|---|---|---|---|
abcd | asdd | 2022 | asdd | 7/4/22 | something |
abcd | asdd | 2022 | asdd | 10/9/22 | something |
abcd | asdd | 2022 | asdd | 12/12/20 | something |
abcd | asdd | 2020 | asdd | 9/1/19 | something |
abcd | asdd | 2020 | asdd | 9/1/22 | something |
abcd | asdd | 2021 | asdd | 9/22/19 | something |
abcd | asdd | 2021 | asdd | 2/16/22 | something |
abcd | asdd | 2021 | asdd | 6/16/22 | something |
Expected output after command: first and last lines since the dates are june and july.
My awk command:
cat file | awk -F'|' '$5 ~ /(6|7|8)\/*\/22$/'
In the pattern "/(6|7|8)\/*\/22$/"
I'm trying to say
m/d/Y - m is either 6, 7 or 8
* - for day
22$ - for year and column ends
Like this, assuming the file is csv
(commas) and not tsv
(tabs) or even |
(pipe) separated file:
awk -F, '$5 ~ /^(6|7|8)\/.*\/22$/' file
But there's no matching line in your sample input
I would use something like
awk -F, '$5 ~ "^[6-8]/[^/]+/22$"'
so you don't have to escape '/` and also can reject malformed dates
This part of the pattern \/*
repeats 0+ times a forward slash instead of matching days.
You might narrow the match for the days part to digits (which still not validate a valid date), and you can omit the cat
command.
awk -F'|' '$5 ~ /^[678]\/([1-9]|[12][0-9]|3[01])\/22$/' file
If you are using a comma as a separator, then use awk -F,
Output
abcd|asdd|2022|asdd|7/4/22|something
abcd|asdd|2021|asdd|6/16/22|something
assuming the data format is correct, you can eliminate formatting matches.
$ awk -F, '$5~/^[6-8].*22$/' file
With your shown samples and attempts please try following awk
code.
awk '$5~/^6\/([1-9]|1[0-9]|2[0-9]|30)\/[0-9]{2}$|^[7-8]\/([1-9]|1[0-9]|2[0-9]|3[0-1])\/[0-9]{2}$/' Input_file
It's not clear (to me) how the input file is delimited so for demonstration purposes I'll assume comma delimited:
$ cat file
col1,col2,col3,col4,col5date,col6
abcd,asdd,2022,asdd,7/4/22,something
abcd,asdd,2022,asdd,10/9/22,something
abcd,asdd,2022,asdd,12/12/20,something
abcd,asdd,2020,asdd,9/1/19,something
abcd,asdd,2020,asdd,9/1/22,something
abcd,asdd,2021,asdd,9/22/19,something
abcd,asdd,2021,asdd,2/16/22,something
abcd,asdd,2021,asdd,6/16/22,something
One awk
idea:
awk -F',' -v mon='6,7,8' -v year="22" ' # define input delimiter as comma; OP provides comma-delimited list of numerical months along with desired year
BEGIN { split(mon,a,",") # split input variable "mon" on commas and place results into array a[]
for (i in a) # loop through indices of array a[] and ...
months[a[i]] # build new array where numerical months are the indices of the array months[]
}
FNR==1 { print; next } # print header row then skip to next line of input
{ split($5,a,"/") # split 5th field on "/" delimiter and place results into array a[]
if (a[1] in months && a[3] == year) # if a[1] is an index in the months[] array and a[3] matches the desired year then ...
print # print current line to stdout
}
' file
This generates:
$ . ./dates.awk
col1,col2,col3,col4,col5date,col6
abcd,asdd,2022,asdd,7/4/22,something
abcd,asdd,2021,asdd,6/16/22,something
if you don't have any other dates in each row:
gawk -b '1 < NF' FS='[,|][6-8][/][^/]+[/]22[,|]'
if you're VERY certain there are no erroneously formatted dates in the input, then even simpler:
mawk '/[,|][6-8][^|]+22[,|]/'
I usually prefer using square brackets for regex special char escape, as it delineates boundaries a lot clearer visually than \\\\\\whiskey\\\\\\tango\\\\\\foxtrot\\\\
chaos in many regex
es, with the added bonus of not having to worry about how many backslashes being "eaten" by each extra layer it's passed through before actually executed by awk
eg forward slash "/"
:: for command line assignments --
mawk
s okay with typing \/
, while
gawk/nawk
need a pair — \\/
,
"[/]"
is all-awk variants-friendly, when it's in double-quoted string form
for regex in the open
/…/
form,mawk/gawk
okay with/…[/]…/
whilenawk
insist an extra backslash/…[\/]…/
, which somewhat defeats the readability rationale for opting for[…]
over\/\/\\+\\/\\\]\\[\\*\\……
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.