你如何用awk解析逗号分隔值（csv）？

Question

I am trying to write an awk script to convert a CSV formatted spreadsheet into XML for Bugzilla bugs. 我正在尝试编写一个awk脚本，将CSV格式的电子表格转换为XML以用于Bugzilla错误。 The format of the input CSV is as follows (created from an XLS spreadsheet and saved as CSV): 输入CSV的格式如下（从XLS电子表格创建并保存为CSV）：

tag_1,tag_2,...,tag_N
value1_1,value1_2,...,value1_N
value2_1,value2_2,...,value2_N
valueM_1,valueM_2,...,valueM_N

The header column represents the name of the XML tag. 标题列表示XML标记的名称。 The above file converted to XML should look as follows: 上面转换为XML的文件应如下所示：

<element>
    <tag_1>value1_1</tag_1>
    <tag_2>value1_2</tag_2>
    ...
    <tag_N>value1_N</tag_N>
</element>
<element>
    <tag_1>value2_1</tag_1>
    <tag_2>value2_2</tag_2>
    ...
    <tag_N>value2_N</tag_N>
</element>
...

The awk script I have to accomplish this follows: 我必须完成的awk脚本如下：

BEGIN {OFS = "\n"}
NR == 1 {for (i = 1; i <=NF; i++)
            tag[i]=$i
         print "<bugzilla version=\"3.4.1\" urlbase=\"http://mozilla.com/\" maintainer=\"somebody@mozilla.com\" exporter=\"somebody.else@mozilla.com\">"}
NR != 1 {print "   <bug>"
         for (i = 1; i <= NF; i++)
            print "      <" tag[i] ">" $i "</" tag[i] ">"
         print "   </bug>"}
END {print "</bugzilla>"}

The actual CSV file is: 实际的CSV文件是：

cf_foo,cf_bar,short_desc,cf_zebra,cf_pizza,cf_dumpling ,assigned_to,bug_status,cf_word,cf_caslte
ABCD,A-BAR-0032,A NICE DESCRIPTION - help me,pretty,Pepperoni,,,NEW,,

The actual output is: 实际输出是：

$ awk -f csvtobugs.awk bugs.csv $ awk -f csvtobugs.awk bugs.csv

<bugzilla version="3.4.1" urlbase="http://mozilla.com/" maintainer="somebody@mozilla.com" exporter="somebody.else@mozilla.com">
   <bug>
      <cf_foo,cf_bar,short_desc,cf_zebra,cf_pizza,cf_dumpling>ABCD,A-BAR-0032,A</cf_foo,cf_bar,short_desc,cf_zebra,cf_pizza,cf_dumpling>
      <,assigned_to,bug_status,cf_word,cf_caslte>NICE</,assigned_to,bug_status,cf_word,cf_caslte>
      <>DESCRIPTION</>
      <>-</>
      <>help</>
      <>me,pretty,Pepperoni,,,NEW,,</>
   </bug>
   <bug>
   </bug>
</bugzilla>

Clearly, not the intended result (I admit, I copy-pasted this script from this forum: http://www.unix.com/shell-programming-scripting/21404-csv-xml.html ). 显然，不是预期的结果（我承认，我从这个论坛复制粘贴了这个脚本： http ： //www.unix.com/shell-programming-scripting/21404-csv-xml.html ）。 The problem is that it's been SOOOOO long since I've looked at awk scripts and I have NO IDEA what the syntax means. 问题是，自从我查看了awk脚本以来，它已经很久了，而且我没有IDEA语法意味着什么。

Answer 1

You need to set FS = "," in the BEGIN rule to use comma as your field separator; 您需要在BEGIN规则中设置FS = ","以使用逗号作为字段分隔符; the code as you show it should work if the field separator was a tab, which is a different (also popular) convention in files that are often still called "CSV" even then commas aren't used;-). 如果字段分隔符是一个选项卡，则显示它的代码应该有效，这是一个不同的（也是流行的）常规文件，通常仍被称为“CSV”，即使这样也不使用逗号;-)。

Answer 2

Use a tool that you do know:) 使用你知道的工具:)

That awk script does not look it deals with " and other CSV oddities. (I think it just splits on tabs - as the other answers note it needs to be change to split on , ) python, perl .Net etc have objects to fully deal with CSV and XML and probably you could write the solution in as few characters as the awk script and MORE importantly understand it. 那个awk脚本看起来不会处理“和其他CSV奇怪。（我认为它只是在选项卡上分裂 - 因为其他答案注意它需要更改为拆分，）python，perl .Net等有完全交易的对象使用CSV和XML，你可以用与awk脚本一样少的字符编写解决方案，更重要的是要理解它。

Answer 3

Remember that splitting by comma in a csv is fine until you get the following scenario: 请记住，在获得以下方案之前，在csv中使用逗号分割是正常的：

1997,Ford,E350,"Super, luxurious truck"

In which case it will split "Super, luxurious truck" into two items which is incorrect. 在这种情况下，它将“超级豪华卡车”分成两个不正确的项目。 I would recommend using the csv libs in another language as 'Mark' states in the above post. 我建议在另一种语言中使用csv libs作为上面帖子中的“Mark”状态。

Answer 4

I was able to fix it by changing the FS (field separator): 我能够通过更改FS（字段分隔符）来修复它：

BEGIN {
    FS=",";
    OFS = "\n"}
NR == 1 {for (i = 1; i <=NF; i++)
            tag[i]=$i
         print "<bugzilla version=\"3.4.1\" urlbase=\"http://mozilla.com/\" maintainer=\"somebody@mozilla.com\" exporter=\"somebody.else@mozilla.com\">"}
NR != 1 {print "   <bug>"
         for (i = 1; i <= NF; i++)
            print "      <" tag[i] ">" $i "</" tag[i] ">"
         print "   </bug>"}
END {print "</bugzilla>"}

Output: 输出：

<bugzilla version="3.4.1" urlbase="http://mozilla.com/" maintainer="somebody@mozilla.com" exporter="somebody.else@mozilla.com">
   <bug>
      <cf_foo>ABCD</cf_foo>
      <cf_bar>A-BAR-0032</cf_bar>
      <short_desc>A NICE DESCRIPTION - help me</short_desc>
      <cf_zebra>pretty</cf_zebra>
      <cf_pizza>Pepperoni</cf_pizza>
      <cf_dumpling ></cf_dumpling >
      <assigned_to></assigned_to>
      <bug_status>NEW</bug_status>
      <cf_word></cf_word>
      <cf_caslte></cf_caslte>
   </bug>
</bugzilla>

Answer 5

You can use various tricks like setting FS. 您可以使用各种技巧，如设置FS。 More tricks can be found at the Awk news group. 在Awk新闻组中可以找到更多技巧。 There are also parsers like mine: http://lorance.freeshell.org/csv/ 还有像我这样的解析器： http ： //lorance.freeshell.org/csv/

Answer 6

You might try my csvprintf instead. 您可以尝试我的csvprintf 。 It can convert CSV to XML, which you can then style with XSLT as desired. 它可以将CSV转换为XML，然后您可以根据需要使用XSLT进行样式设置。

你如何用awk解析逗号分隔值（csv）？

问题描述

6 个解决方案

解决方案1
4 已采纳 2009-09-18 17:02:15

解决方案2
1 2009-09-18 17:02:26

解决方案3
1 2009-10-04 17:53:56

解决方案4
0 2009-09-18 17:02:17

解决方案5
0 2009-09-26 18:03:57

解决方案6
0 2011-07-20 19:58:24

你如何用awk解析逗号分隔值（csv）？

问题描述

6 个解决方案

解决方案1 4 已采纳 2009-09-18 17:02:15

解决方案2 1 2009-09-18 17:02:26

解决方案3 1 2009-10-04 17:53:56

解决方案4 0 2009-09-18 17:02:17

解决方案5 0 2009-09-26 18:03:57

解决方案6 0 2011-07-20 19:58:24

解决方案1
4 已采纳 2009-09-18 17:02:15

解决方案2
1 2009-09-18 17:02:26

解决方案3
1 2009-10-04 17:53:56

解决方案4
0 2009-09-18 17:02:17

解决方案5
0 2009-09-26 18:03:57

解决方案6
0 2011-07-20 19:58:24