I want to parse CSV records like the one below with awk
or gawk
.
The fields are separated by commas but the last field ( $6
) is special because it really consists of subfields. These subfields are separated by # as the field separator (or, to be precise, ". # "). This in itself is not a problem: I can use awk -F'(,)|(. # )'
to set alternative field separators.
However, there are stray commas in this last field as well that need to be ignored.
Is there a way to solve this with awk
, perhaps using FPAT?
Sample record:
"http://publications.europa.eu/resource/cellar/3befa3c3-a9af-4dac-baa2-92e95cb6e3ab","http://publications.europa.eu/resource/cellar/3befa3c3-a9af-4dac-baa2-92e95cb6e3ab.0002","EU:C:1985:443","61984CJ0239","Gerlach","Judgment of the Court (Third Chamber) of 24 October 1985. # Gerlach & Co. BV, Internationale Expeditie, v Minister van Economische Zaken. # Reference for a preliminary ruling: College van Beroep voor het Bedrijfsleven - Netherlands. # Article 41 ECSC - Anti-dumping duties. # Case 239/84."
Using FPAT
feature in gnu-awk
, you may be able to do this. We use FPAT
to match all double quoted fields or comma separated fields. Finally we split on last field using /\. # /
/\. # /
regex pattern.
s='"http://publications.europa.eu/resource/cellar/3befa3c3-a9af-4dac-baa2-92e95cb6e3ab","http://publications.europa.eu/resource/cellar/3befa3c3-a9af-4dac-baa2-92e95cb6e3ab.0002","EU:C:1985:443","61984CJ0239","Gerlach","Judgment of the Court (Third Chamber) of 24 October 1985. # Gerlach & Co. BV, Internationale Expeditie, v Minister van Economische Zaken. # Reference for a preliminary ruling: College van Beroep voor het Bedrijfsleven - Netherlands. # Article 41 ECSC - Anti-dumping duties. # Case 239/84."'
awk -v FPAT='"[^"]*"|[^,]+' '{
# loop through all fields except last one
for (i=1; i<NF; ++i)
print i, $i
# split last field using /\. # / regex and print each token
for (j=1; j<split($NF, a, /\. # /); ++j)
print i+j-1, a[j]
}' <<< "$s"
1 "http://publications.europa.eu/resource/cellar/3befa3c3-a9af-4dac-baa2-92e95cb6e3ab"
2 "http://publications.europa.eu/resource/cellar/3befa3c3-a9af-4dac-baa2-92e95cb6e3ab.0002"
3 "EU:C:1985:443"
4 "61984CJ0239"
5 "Gerlach"
6 "Judgment of the Court (Third Chamber) of 24 October 1985
7 Gerlach & Co. BV, Internationale Expeditie, v Minister van Economische Zaken
8 Reference for a preliminary ruling: College van Beroep voor het Bedrijfsleven - Netherlands
9 Article 41 ECSC - Anti-dumping duties
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.