简体   繁体   English

使用 sed 将带双引号的 TXT 文件转换为竖线分隔格式

[英]Converting a TXT file with double quotes to a pipe-delimited format using sed

I'm trying to convert TXT files into pipe-delimited text files.我正在尝试将 TXT 文件转换为竖线分隔的文本文件。

Let's say I have a file called sample.csv :假设我有一个名为sample.csv的文件:

aaa",bbb"ccc,"ddd,eee",fff,"ggg,hhh,iii","jjj kkk","lll"" mmm","nnn"ooo,ppp"qqq",rrr" sss,"ttt,""uuu",Z

I'd like to convert this into an output that looks like this:我想将其转换为如下所示的输出:

aaa"|bbb"ccc|ddd,eee|fff|ggg,hhh,iii|jjj kkk|lll" mmm|"nnn"ooo|ppp"qqq"|rrr" sss|ttt,"uuu|Z

Now after tons of searching, I have come the closest using this sed command:现在经过大量搜索,我使用这个sed命令最接近:

sed -r 's/""/\\v/g;s/("([^"]+)")?,/\\2\\|/g;s/"([^"]+)"$/\\1/;s/\\v/"/g'

However, the output that I received was:但是,我收到的输出是:

aaa"|bbb"ccc|ddd,eee|fff|ggg,hhh,iii|jjj kkk|lll" mmm|"nnn"ooo|pppqqq|rrr" sss|ttt,"uuu|Z

Where the expected for the 9th column should have been ppp"qqq" but the result removed the double quotes and what I got was pppqqq .第 9 列的预期应该是ppp"qqq"但结果删除了双引号,我得到的是pppqqq

I have been playing around with this for a while, but to no avail.我一直在玩这个一段时间,但无济于事。 Any help regarding this would be highly appreciated.对此的任何帮助将不胜感激。

As suggested in comments sed or any other Unix tool is not recommended for this kind of complex CSV string.正如评论中所建议的,对于这种复杂的 CSV 字符串,不建议使用sed或任何其他 Unix 工具。 It is much better to use a dedicated CSV parser like this in PHP:在 PHP 中使用像这样的专用 CSV 解析器要好得多:

$s = 'aaa",bbb"ccc,"ddd,eee",fff,"ggg,hhh,iii","jjj kkk","lll"" mmm","nnn"ooo,ppp"qqq",rrr" sss,"ttt,""uuu",Z';
echo implode('|', str_getcsv($s));
aaa"|bbb"ccc|ddd,eee|fff|ggg,hhh,iii|jjj kkk|lll" mmm|nnnooo|ppp"qqq"|rrr" sss|ttt,"uuu|Z

The problem with sample.csv is that it mixes non-quoted fields (containing quotes) with fully quoted fields (that should be treated as such). sample.csv的问题在于它将未引用的字段(包含引号)与完全引用的字段(应该这样处理)混合在一起。

You can't have both at the same time.你不能同时拥有两者。 Either all fields are (treated as) unquoted and quotes are preserved, or all fields containing a quote (or separator) are fully quoted and the quotes inside are escaped with another quote.要么所有字段都(被视为)不加引号并保留引号,要么所有包含引号(或分隔符)的字段都被完全引用并且里面的引号用另一个引号转义。

So, sample.csv should become:所以, sample.csv应该变成:

"aaa""","bbb""ccc","ddd,eee",fff,"ggg,hhh,iii","jjj kkk","lll"" mmm","""nnn""ooo","ppp""qqq""","rrr"" sss","ttt,""uuu",Z

to give you the desired result (using a csv parser):为您提供所需的结果(使用 csv 解析器):

aaa"|bbb"ccc|ddd,eee|fff|ggg,hhh,iii|jjj kkk|lll" mmm|"nnn"ooo|ppp"qqq"|rrr" sss|ttt,"uuu|Z

Have the same problem.有同样的问题。 I found right result with https://www.papaparse.com/demo Here is a FOSS on github.我通过https://www.papaparse.com/demo找到了正确的结果 这是 github 上的一个 FOSS。 So maybe you can check how it works.所以也许你可以检查它是如何工作的。 With the source of [ "aaa""","bbb""ccc","ddd,eee",fff,"ggg,hhh,iii","jjj kkk","lll"" mmm","""nnn""ooo","ppp""qqq""","rrr"" sss","ttt,""uuu",Z ] The result appears in the browser console: [1]: https://i.stack.imgur.com/OB5OM.png与 ["aaa""","bbb""ccc","ddd,eee",fff,"ggg,hhh,iii","jjj kkk","lll"" mmm","""nnn 的来源""ooo","ppp""qqq""","rrr"" sss","ttt,""uuu",Z ] 结果出现在浏览器控制台: [1]: https://i.stack .imgur.com/OB5OM.png

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM