简体   繁体   English

csv格式是常规语法还是无上下文语法?

[英]Is csv format regular grammar or context-free grammar?

I am currently writing a csv parser. 我目前正在编写一个csv解析器。 The definition of csv format is given by RFC4180 which is defined by ABNF. csv格式的定义由ABNF定义的RFC4180给出。 So the definition of csv is absolutely a contex-free grammar. 因此,csv的定义绝对是无上下文语法。 However, I would like to know if csv is regular grammar? 但是,我想知道csv是否为常规语法? So that I could parse it with just a finite state machine. 这样我就可以用一个有限状态机来解析它。 Furthermore, if it is exactly a regular grammar and can be parsed by finite state machine, does that mean it can be also parsed by regular expression? 此外,如果它恰好是一个正则语法,并且可以通过有限状态机进行解析,那是否意味着它也可以通过正则表达式进行解析?

I don't have any formal theory available to verify this, but I'm pretty sure CSV files can reliably be parsed with regular expressions. 我没有任何形式化的理论可以验证这一点,但是我很确定CSV文件可以使用正则表达式可靠地解析。 It's probably best to use two regexes, though: 不过,最好使用两个正则表达式:

  • One regex to match an entire CSV row (including linebreaks in quoted fields) 一个正则表达式可匹配整个CSV行(包括引用字段中的换行符)
  • Another regex (to be used on the match result of the first one) to match single fields 另一个正则表达式(用于第一个正则表达式的匹配结果)以匹配单个字段

(unless you're using the .NET regex engine which provides access to individual captures of a repeating capturing group, or unless you know the number of columns in your CSV file beforehand and hard-code that into your regex). (除非您使用的是.NET正则表达式引擎,该引擎提供对重复捕获组的单个捕获的访问权限,或者除非您事先知道CSV文件中的列数并将其硬编码到正则表达式中)。

A PCRE regex to match an entire CSV row could be: 匹配整个CSV行的PCRE正则表达式可以是:

/^(?:(?:[^",\r\n]*|"(?:""|[^"]*)*+")(?:,|$))*+(?=$)/m

You need to use the /m modifier here to allow ^ and $ to match newlines. 您需要在此处使用/m修饰符,以允许^$匹配换行符。 If you're processing the file line by line, then the regex will fail on a line that's not a complete CSV row (ie where a quoted field hasn't been closed yet), so you would need to read the next line, add it to your test string and reapply the regex (you can remove the /m modifier in this scenario). 如果您要逐行处理文件,则正则表达式将在不是完整CSV行的行(即,引号字段尚未关闭的行)上失败,因此您需要阅读下一行,添加将其添加到测试字符串并重新应用正则表达式(在这种情况下,您可以删除/m修饰符)。 Repeat until it matches. 重复直到匹配。

Once you have that row, you can use this regex to match each successive field: 一旦有了该行,就可以使用此正则表达式来匹配每个后续字段:

/([^",\r\n]*|"(?:""|[^"]*)*+")(?:,|$)/

The match result here also contains the delimiter ( , or newline), so the actual field's contents must be extracted from group 1. You will also need to process the surrounding and embedded quotes after the match. 这里的比赛结果也包含分隔符( ,或换行),因此实际的字段的内容,必须从组1中提取您还需要处理赛后周围和嵌入式引号。

Explanation: 说明:

^             # Start of line (/m modifier!)
(?:           # Start of non-capturing group (to contain the entire line):
 (?:          # Start of non-capturing group (to contain a single field):
  [^",\r\n]*  # Either match a run of character except quotes, commas or newlines
 |            # or
  "           # Match a quoted field, starting with a quote, followed by
  (?:         # either...
   ""         # an escaped quote
  |           # or
   [^"]*      # anything that's not a quote
  )*+         # repeated as often as possible, no backtracking allowed
  "           # Then match a closing quote
 )            # End of group (=field)
 (?:,|$)      # Match a delimiter or the end of the line
)*+           # repeated as often as possible, no backtracking allowed
(?=$)         # Assert that we're now at the end of a line

There is no definite answer to this question because CSV is a very loose format. 由于CSV是一种非常宽松的格式,因此没有明确的答案。 Among the CSV readers that I have observed both context-free and regular grammars are maintained. 在我观察到的CSV阅读器中,无上下文语法和常规语法都得到了维护。 For example some readers would throw an exception if anything but a comma follows after the end of an enclosed value. 例如,如果在封闭值的结尾之后出现逗号以外的内容,则某些读者会抛出异常。

You should be able to parse CSV files with a simple finite-state machine. 您应该能够使用简单的有限状态机来解析CSV文件。 Or, to be more precise, with one of a large number of simple FSMs depending on the precise CSV format. 或者,更确切地说,取决于精确的CSV格式,使用大量的简单FSM之一。 (That doesn't mean it's a good idea. There are CSV parsing libraries which are much better at dealing with all the weird variants and unwritten rules of CSV files you might find in the wild.) (这并不意味着这是个好主意。有些CSV解析库可以更好地处理您可能在野外发现的CSV文件的所有怪异变体和未编写的规则。)

Here are some (untested) flex rules without good error-handling for the simplest CSV-variant: 以下是一些(未经测试的)flex规则,这些规则对于最简单的CSV变量没有良好的错误处理:

  • fields are separated with , 字段之间用分隔

  • whitespace is not in any way special, except for unquoted newlines which separate records 空格在任何方面都没有特殊之处,除了用单引号引起来的分隔记录的换行符

  • fields which include " , , or newline characters must be quoted; any field may be quoted. 其中包括字段“,或换行字符必须被引用;任何字段可以被引用。

  • a " in a quoted field is represented as two " characters. 一个在引用一个字段被表示为两个字符。


%%
int record = 1;
int field = 1;

[^",\n]*/[^"]   { printf("Record %d Field %d: |%s|\n", record, field, yytext); }
[,]             { ++field; }
[\n]            { ++line; field = 1; }
["]([^"]|["]["]*)["]/[,\n] {
                  printf("Record %d Field %d: |%s|\n", record, field, yytext); }
.               { printf("Something bad happened in record %d field %d\n",
                          record, field); }

That doesn't handle quoted strings properly (ie, it doesn't strip the quotes or undouble doubled quotes). 那不能正确处理带引号的字符串(即,它不去除引号或不加倍双引号)。

The simplest way to handle quoted fields is with a start condition (which is still implemented as part of an FSM): 处理带引号的字段的最简单方法是使用开始条件(仍作为FSM的一部分实现):

%x QUOTED

%%
int record = 1;
int field = 1;

[^",\n]*/[^"]     { printf("Record %d Field %d: |%s|\n", record, field, yytext); }
[,]               { ++field; }
[\n]              { ++line; field = 1; }

["]               { printf("Record %d Field %d: |", record, field); BEGIN(QUOTED); }
<QUOTED>[^"]*     { printf("%s", yytext); }
<QUOTED>["]["]    { putchar('"'); }
<QUOTED>["]/[,\n] { putchar('|'); putchar('\n'); BEGIN(INITIAL); }

<*>.              { printf("Something bad happened in record %d field %d\n",
                           record, field); }

So the theory-based answer is No, the CSV file format is not a regular language (based on that RFC). 因此,基于理论的答案为否,CSV文件格式不是常规语言(基于该RFC)。

The main reason that it is not is based on this line from the specification: 不能使用它的主要原因是基于规范中的这一行:

Each line should contain the same number of fields throughout the file. 每行应在整个文件中包含相同数量的字段。

To formally prove that the file format is not a regular language, you would use the pumping lemma for regular languages . 要正式证明文件格式不是常规语言,可以对常规语言使用抽水引理

Consider the string which is 2 lines and p columns (where p is the pumping length from the pumping lemma) where each cell is empty (so if p = 3, it would be ",,\\n,,\\n". In order to satisfy the condition that |xy| <= p and |y| > 1, then "y" must be 1 or more commas in the first line of the file. If you then "pump" the y, then you will have more cells on your first line then your second. Therefore, it is not a regular language. 考虑一个由2行和p列组成的字符串(其中p是来自抽运引理的抽运长度),其中每个像元都是空的(因此,如果p = 3,则为“ ,, \\ n,\\ n”。要满足| xy | <= p和| y |> 1的条件,则“ y”必须在文件的第一行中是1个或多个逗号。如果您随后“抽取” y,则将有更多第一行中的单元格,然后第二行中的单元格,因此,这不是常规语言。

However , as is often the case, the theoretical answer is likely not what you really need. 但是 ,通常情况下,理论上的答案可能并不是您真正需要的。 For one, many regular expression syntaxes (and finite state machine syntaxes) in many programming languages actually support more than true regular languages. 首先,许多编程语言中的许多正则表达式语法(和有限状态机语法)实际上不仅仅支持真正的正则语言。

Also, just because you can't verify if a string truly conforms to the CSV spec with a true regular expression does not mean that you can't still parse it with one. 另外,仅因为您无法使用真正的正则表达式来验证字符串是否确实符合CSV规范并不意味着您仍然无法使用一个字符串来解析它。 You may just accept slightly malformed CSV files (such as ones that have uneven row lengths). 您可能只接受格式稍有错误的CSV文件(例如行长不均匀的CSV文件)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM