使用awk或sed打印用双引号引起来的CSV文件列

Question

I'm working on a csv file like the one below, comma delimited, each cell is enclosed in double quotes, but some of them contain double quote and/or comma inside double quote enclosure. 我正在处理以下csv文件，以逗号分隔，每个单元格都用双引号引起来，但是其中一些包含双引号和/或双引号内的逗号。 The actual file contain around 300 columns and 200,000 rows. 实际文件包含大约300列和200,000行。

"Column1","Column2","Column3","Column4","Column5","Column6","Column7"
"abc","abc","this, but with "comma" and a quote","18"" inch TV","abc","abc","abc"
"cde","cde","cde","some other, "cde" here","cde","cde","cde"

I'll need to remove some unless columns, and merge last few columns, instead of having "," in between them, I need </br> . 我需要删除一些除非列，然后合并最后几列，而不是在它们之间使用"," ， </br>需要</br> 。 and move second column to the end. 并将第二列移到末尾。 Anything within the cells should be the same, with double quotes and commas as the original file. 单元格中的所有内容都应相同，并使用双引号和逗号作为原始文件。 Below is an example of the output that I need. 以下是我需要的输出示例。

"Column1","Column4","Column5","Column2"
"abc","18"" inch TV","abc</br>abc</br>abc","abc"
"cde","some other, "cde" here","cde</br>cde</br>cde","cde"

In this example I want to remove column3 and merge column 5, 6, 7. 在此示例中，我想删除column3并合并5、6、7。

Below is the code that I tried to use, but it is reading either double quote and/or comma, which is end of the row to be different than what I expected. 下面是我尝试使用的代码，但是它正在读取双引号和/或逗号，该行的结尾与我期望的不同。

awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, '{print $1,$4,$5"</br>"$6"</br>"$7",$2}' inputfile.csv

sed -i 's@"</br>"@</br>@g' inputfile.csv

sed is used to remove beginning and ending double quote of a cell. sed用于删除单元格的开始和结束双引号。

The output file that I'm getting right now, if previous field contains a double quote, it will consider that is the beginning of a cell, so the following values are often pushed up a column. 我现在正在获取的输出文件，如果上一个字段包含双引号，它将认为这是单元格的开始，因此以下值通常被上推至一列。

Other code that I have used consider every comma as beginning of a cell, so that won't work as well. 我使用的其他代码将每个逗号都视为一个单元格的开头，因此它也不起作用。

awk -F',' 'BEGIN{OFS=",";} {print $1,$4,$5"</br>"$6"</br>"$7",$2}' inputfile.csv

sed -i 's@"</br>"@</br>@g' inputfile.csv

Any help is greatly appreciated. 任何帮助是极大的赞赏。 thanks! 谢谢！

Answer 1

CSV is a loose format. CSV是一种宽松格式。 There may be subtle variations in formatting. 格式可能会有细微的变化。 Your particular format may or may not be expressible with a regular grammar/regular expression. 您的特定格式可能会也可能无法通过正则语法/正则表达式表达。 (See this question for a discussion about this.) Even if your particular formatting can be expressed with regular expressions, it may be easier to just whip out a parser from an existing library. （有关此问题的讨论，请参阅此问题。）即使您的特定格式可以用正则表达式表示，从现有库中提取解析器也会更容易。

It is not a bash/awk/sed solution as you may have wanted or needed, but Python has a csv module for parsing CSV files. 它不是您可能想要或需要的bash / awk / sed解决方案，但是Python具有用于解析CSV文件的csv模块。 There are a number of options to tweak the formatting. 有许多选项可以调整格式。 Try something like this: 尝试这样的事情：

#!/usr/bin/python

import csv

with open('infile.csv', 'r') as infile, open('outfile.csv', 'wb') as outfile:
    inreader = csv.reader(infile)
    outwriter = csv.writer(outfile, quoting=csv.QUOTE_ALL)
    for row in inreader:
        # Merge fields 5,6,7 (indexes 4,5,6) into one
        row[4] = "</br>".join(row[4:7])
        del row[5:7]

        # Copy second field to the end
        row.append(row[1])

        # Remove second and third fields
        del row[1:3]

        # Write manipulated row
        outwriter.writerow(row)

Note that in Python, indexes start with 0 (eg row[1] is the second field). 请注意，在Python中，索引以0开头（例如， row[1]是第二个字段）。 The first index of a slice is inclusive, the last is exclusive ( row[1:3] is row[1] and row[2] only). 切片的第一个索引为包含索引，最后一个索引为排斥索引（仅row[1:3]为row[1]和row[2] ）。 Your formatting seems to require quotes around every field, hence the quoting=csv.QUOTE_ALL . 您的格式似乎要求每个字段都quoting=csv.QUOTE_ALL引号，因此quoting=csv.QUOTE_ALL 。 There are more options at Dialects and Formatting Parameters . “ 方言和格式参数”中有更多选项。

The above code produces the following output: 上面的代码产生以下输出：

"Column1","Column4","Column5</br>Column6</br>Column7","Column2"
"abc","18"" inch TV","abc</br>abc</br>abc","abc"
"cde","some other, cde"" here""","cde</br>cde</br>cde","cde"

There are two issues with this: 这有两个问题：

It doesn't treat the first row any differently, so the headers of columns 5, 6, and 7 are merged like the other rows. 它不会对第一行进行不同的处理，因此第5、6和7列的标题与其他行一样合并。
Your input CSV contains "some other, "cde" here" (third row, fourth column) with unescaped quotes around the cde . 您输入的CSV包含"some other, "cde" here" （第三行，第四列），并在cde周围带有未转义的引号。 There is another case of this on line two, but it was removed since it is in column 3. The result contains incorrect quotes. 第二行还有另一种情况，但是由于它在第3列中而被删除了。结果包含不正确的引号。

If these quotes are properly escaped, your sample input CSV file becomes 如果这些引号正确转义，则示例输入CSV文件将变为

infile.csv (escaped quotes) : infile.csv （引号） ：

"Column1","Column2","Column3","Column4","Column5","Column6","Column7"
"abc","abc","this, but with ""comma"" and a quote","18"" inch TV","abc","abc","abc"
"cde","cde","cde","some other, ""cde"" here","cde","cde","cde"

Now consider this modified Python script that doesn't merge columns on the first row: 现在考虑修改后的Python脚本，该脚本不会合并第一行中的列：

#!/usr/bin/python

import csv

with open('infile.csv', 'r') as infile, open('outfile.csv', 'wb') as outfile:
    inreader = csv.reader(infile)
    outwriter = csv.writer(outfile, quoting=csv.QUOTE_ALL)
    first_row = True
    for row in inreader:
        if first_row:
            first_row = False
        else:
            # Merge fields 5,6,7 (indexes 4,5,6) into one
            row[4] = "</br>".join(row[4:7])
        del row[5:7]

        # Copy second field (index 1) to the end
        row.append(row[1])

        # Remove second and third fields
        del row[1:3]

        # Write manipulated row
        outwriter.writerow(row)

The output outfile.csv is 输出outfile.csv是

"Column1","Column4","Column5","Column2"
"abc","18"" inch TV","abc</br>abc</br>abc","abc"
"cde","some other, ""cde"" here","cde</br>cde</br>cde","cde"

This is your sample output, but with properly escaped "some other, ""cde"" here" . 这是您的示例输出，但带有正确转义的"some other, ""cde"" here" 。

This may not be precisely what you wanted, not being a sed or awk solution, but I hope it is still useful. 这可能不是您想要的，不是sed或awk解决方案，但我希望它仍然有用。 Processing more complicated formats may justify more complicated tools. 处理更复杂的格式可能证明更复杂的工具是合理的。 Using an existing library also removes a few opportunities to make mistakes. 使用现有的库也消除了一些出错的机会。

Answer 2

This might be an oversimplification of the problem but this has worked for me with your test data: 这可能是问题的过分简化，但是对您的测试数据来说这对我有用：

cat /tmp/inputfile.csv | sed 's@\"\,\"@|@g' | sed 's@"</br>"@</br>@g' | awk 'BEGIN {FS="|"} {print $1 "," $4 "," $5 "</br>" $6 "</br>" $7 "," $2}'

Please not that I am on Mac probably that's why I had to wrap the commas in the AWK script in quotation marks. 请不要因为我在Mac上，否则可能就是为什么必须在AWK脚本中用引号将逗号引起来。

使用awk或sed打印用双引号引起来的CSV文件列

问题描述

2 个解决方案

解决方案1
2 已采纳 2016-02-15 07:24:43

解决方案2
0 2016-02-15 08:55:43

使用awk或sed打印用双引号引起来的CSV文件列

问题描述

2 个解决方案

解决方案1 2 已采纳 2016-02-15 07:24:43

解决方案2 0 2016-02-15 08:55:43

解决方案1
2 已采纳 2016-02-15 07:24:43

解决方案2
0 2016-02-15 08:55:43