简体   繁体   English

在PHP中解析CSV的正则表达式

[英]Regular expression for parsing CSV in PHP

I already managed to split the CSV file using this regex: "/,(?=(?:[^\\"] \\"[^\\"] \\") (?![^\\"] \\"))/" 我已经使用此正则表达式拆分了CSV文件:“ /,(?=(?:[^ \\”] \\“ [^ \\”] \\“) (?![^ \\”] \\“))//”

But I ended up with an array of strings that contain the opening and ending double quotes. 但是我最后得到了一个包含开头和结尾双引号的字符串数组。 Now I need a regex that would strip those strings of the delimiter double quotes. 现在,我需要一个正则表达式,将那些分隔符双引号的字符串删除。

As far as I know the CSV format can encapsulate strings in double quotes, and all the double quotes that are already a part of the string are doubled. 据我所知,CSV格式可以将字符串封装在双引号中,并且已经是字符串一部分的所有双引号都将加倍。 For example: 例如:

My "other" cat 我的“另一只”猫

becomes 变成

"My ""other"" cat" “我的“其他”猫”

What I basically need is a regex that will replace all sequences of N doublequotes with a sequence of (N/2 - rounded down) double quotes. 我基本上需要的是一个正则表达式,它将用(N / 2-舍入)双引号序列替换所有N个双引号序列。

Or is there a better way ? 或者,还有更好的方法 ? Thanks in advance. 提前致谢。

有读取csv文件的功能: fgetcsv

Why do you bother splitting the file with regex when there's fgetcsv function that does all the hard work for you? 当有fgetcsv函数为您完成所有艰苦的工作时,为什么还要用正则表达式来烦扰拆分文件?

You can pass in the separator and delimiter and it will detect what to do. 您可以传入分隔符和分隔符,它将检测到该怎么做。

For those of you who wan't to use regex instead of fgetcsv. 对于那些不想使用正则表达式代替fgetcsv的人。 Here is a complete example how to create a html table from csv using a regex. 这是一个完整的示例,说明如何使用正则表达式从csv创建html表。

    $data = file_get_contents('test.csv');
    $pieces = explode("\n", $data);

    $html .= "<table border='1'>\n";
    foreach (array_filter($pieces) as $line) {

            $html .= "<tr>\n";
            $keywords = preg_split('/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/', $line,-1,PREG_SPLIT_DELIM_CAPTURE);

            foreach ($keywords as $col) {
                    $html .= "<td>".trim($col, '"')."</td>\n";
            }
            $html .= "</tr>\n";
    }
    $html .= "</table>\n";

I agree with the others who said you should use the fgetcsv function instead of regexes. 我同意其他人所说的,您应该使用fgetcsv函数而不是正则表达式。 A regex may work okay on well-formed CSV data, but if the CSV is malformed or corrupt, the regex will silently fail, probably returning bogus results in the process. 一个正则表达式可以在格式正确的CSV数据上正常工作,但是如果CSV格式不正确或损坏,则该正则表达式将自动失败,可能会在此过程中返回假结果。

However, the question was specifically about stripping unwanted quotation marks after the initial split. 但是,问题特别是关于在初始拆分后去除不需要的引号。 The one proposed solution (so far) is too naive, and it only deals the escaped quotes inside a field, not the actual delimiters. 一个建议的解决方案(到目前为止)太幼稚了,它只在一个字段内处理转义的引号,而不是实际的定界符。 (I know the OP didn't ask about those, but they do need to be removed, so why not do them at the same as the others?) Here's my solution: (我知道OP并没有询问这些问题,但是确实需要删除它们,所以为什么不将它们与其他对象同时使用呢?)这是我的解决方案:

$csv_field = preg_replace('/"(.|$)/', '\1', $csv_field);

This regex matches a quotation mark followed by any character or by the end of the string, and replaces the matched character(s) with the second character, or with the empty string if it was the $ that matched. 此正则表达式与引号匹配,后跟任何字符或字符串的结尾,并用第二个字符或空字符串(如果是匹配的$ )替换匹配的字符。 According to the spec, CSV fields can contain line separators; 根据规范,CSV字段可以包含行分隔符; that doesn't seem to happen much, but you can add the 's' modifier to the regex if you need to. 这似乎并没有发生太多,但是您可以根据需要将's'修饰符添加到正则表达式中。

preg_split('/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/', $line,-1,PREG_SPLIT_DELIM_CAPTURE);

Has Problems with " inside of strings like "Toys"R"Us" 在“ Toys” R“ Us”之类的字符串内部存在“”问题

So u should use instead: 因此,您应该改用:

preg_split('/'.$seperator.'(?=(?:[^\"])*(?![^\"]))/', $line,-1, PREG_SPLIT_DELIM_CAPTURE);

这是我的快速尝试,尽管它仅适用于单词边界。

preg_replace('/([\W]){2}\b/', '\1', $csv)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM