I want to access and manipulate the fourth column of a csv file.In particular I want to exclude the lines that don't meet a specific requirement ( exclude the lines that do not have 3 character country code).
My data set:
Luxembourg,LUX,2017,9294689.12
Aruba,ABW,2017,927865.82
Nepal,NPL,2017,9028196.37
Bangladesh,BGD,2017,88057460.51
Costa Rica,CRI,2017,8695008.05
Chile,CHL,2017,84603249.72
Cook Islands,COK,2017,82045.41
World,OWIDWRL,1755,9361520
India,INDIA,1763,0
Asia and Pacific (other),,2017,5071156099
World,OWID_WRL,1752,9354192
Middle East,,1751,0
International transport,,1751,0
India,IND,1751,0
Europe (other),,1751,0
China,CHN,1751,0
Asia and Pacific (other),,1751,0
Americas (other),,1751,0
Africa,,1751,0
Thanks in advance.
I have already sorted my data file by year, but I don't know how to access the 4th column and use awk or sed.
Expected Data set:
Luxembourg,LUX,2017,9294689.12
Aruba,ABW,2017,927865.82
Nepal,NPL,2017,9028196.37
Bangladesh,BGD,2017,88057460.51
Costa Rica,CRI,2017,8695008.05
Chile,CHL,2017,84603249.72
Cook Islands,COK,2017,82045.41
If I got your question correctly, could you please try following. Where code looks if 2nd field of any line is NOT having exact 3 characters in it then DO NOT print that line.
awk 'BEGIN{FS=","} $2~/^[a-zA-Z]{3}$/' Input_file
In case you have OLD awk
where range {3}
doesn't work the try.
awk 'BEGIN{FS=","} $2~/^[a-zA-Z][a-zA-Z][a-zA-Z]$/' Input_file
Explanation: Adding explanation for above code here.
awk ' ##Starting awk program here.
BEGIN{ ##Starting BEGIN section from here. Which will be executed before Input_file is being read
FS="," ##Setting field separator as comma here.
} ##Closing BEGIN section here.
$2~/^[a-zA-Z]{3}$/ ##Checking condition if 2nd field is starting with alphabets 3 occurrence of it and ending with it too.
##Since awk works on method of condition then action; so if condition is TRUE then perform certain action.
##In this case no action given so by default print of line will happen.
' Input_file ##Mentioning Input_file name here.
The below would output only lines with a 3 letter value in the second field:
awk --re-interval -F, 'tolower($2) ~ /^[a-z]{3}$/' country.txt
Checking the length is also possible, but this ensures only 3 letters are provided.
--re-internval
allows you to use itnernval expressions in RE's as braces are reserved characters in awk.
-F,
tells awk the input delimiter is comma.
print
is the default action in awk, so tolower($2) ~ /^[az]{3}$/
is a shorthand way of saying tolower($2) ~ /^[az]{3}$/ {print}
The tolower($2)
is lowercasing the value of the second field, and ~
is the regex comparison operator, which we use to check for the beginning of the string ^
, then [az]
repeated {3}
times and the end of the string $
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.