简体   繁体   English

用C解析iCalendar文件

[英]Parsing an iCalendar file in C

I am looking to parse iCalendar files using C. I have an existing structure setup and reading in all ready and want to parse line by line with components. 我正在寻找使用C解析iCalendar文件的方法。我有一个现有的结构设置,并已准备就绪,并且想逐行解析组件。

For example I would need to parse something like the following: 例如,我需要解析以下内容:

UID:uid1@example.com
DTSTAMP:19970714T170000Z
ORGANIZER;CN=John Doe;SENT-BY="mailto:smith@example.com":mailto:john.doe@example.com
CATEGORIES:Project Report, XYZ, Weekly Meeting
DTSTART:19970714T170000Z
DTEND:19970715T035959Z
SUMMARY:Bastille Day Party

Here are some of the rules: 以下是一些规则:

  • The first word on each line is the property name 每行的第一个单词是属性名称
  • The property name will be followed by a colon (:) or a semicolon (;) 属性名称后跟冒号(:)或分号(;)
  • If it is a colon then the property value will be directly to the right of the content to the end of the line 如果是冒号,则属性值将直接位于内容的右侧到行尾
  • A further layer of complexity is added here as a comma separated list of values are allowed that would then be stored in an array. 此处添加了另一层复杂性,因为允许使用逗号分隔的值列表,然后将其存储在数组中。 So the CATEGORIES one for example would have 3 elements in an array for the values 因此,例如一个CATEGORIES将在数组中包含3个元素作为值
  • If after the property name a semi colon is there, then there are optional parameters that follow 如果在属性名称后有半冒号,则后面有可选参数
  • The optional parameter format is ParamName=ParamValue. 可选参数格式为ParamName = ParamValue。 Again a comma separated list is supported here. 再次支持逗号分隔的列表。
  • There can be more than one optional parameter as seen on the ORGANIZER line. ORGANIZER行上可以看到一个以上的可选参数。 There would just be another semicolon followed by the next parameter and value. 在下一个参数和值之后将是另一个分号。
  • And to throw in yet another wrench, quotations are allowed in the values. 为了使用另一个扳手,值中允许使用引号。 If something is in quotes for the value it would need to be treated as part of the value instead of being part of the syntax. 如果值用引号引起来,则需要将其视为值的一部分,而不是语法的一部分。 So a semicolon in a quotation would not mean that there is another parameter it would be part of the value. 因此,引号中的分号并不意味着会有另一个参数将成为值的一部分。

I was going about this using strchr() and strtok() and have got some basic elements from that, however it is getting very messy and unorganized and does not seem to be the right way to do this. 我正在使用strchr()strtok()并从中获得了一些基本元素,但是它变得非常混乱且杂乱无章,似乎不是执行此操作的正确方法。

How can I implement such a complex parser with the standard C libraries (or the POSIX regex library)? 如何使用标准C库(或POSIX regex库)实现这种复杂的解析器? (not looking for whole solution, just starting point) (不是寻找完整的解决方案,只是起点)

This answer is supposing that you want to roll your own parser using Standard C. In practice it is usually better to use an existing parser because they have already thought of and handled all the weird things that can come up. 该答案假设您想使用Standard C来滚动自己的解析器。实际上,通常最好使用现有的解析器,因为他们已经考虑并处理了所有可能出现的奇怪问题。

My high level approach would be: 我的高级方法是:

  • Read a line 读一行
  • Pass pointer to start of this line to a function parse_line : 将指向该行开头的指针传递给函数parse_line
    • Use strcspn on the pointer to identify the location of the first : or ; 在指针上使用strcspn标识第一个的位置:; (aborting if no marker found) (如果找不到标记,则中止)
    • Save the text so far as the property name 将文本保存为属性名称
    • While the parsing pointer points to ; 而解析指针指向; :
      • Call a function extract_name_value_pair passing address of your parsing pointer. 调用解析指针的函数extract_name_value_pair传递地址。
      • That function will extract and save the name and value, and update the pointer to point to the ; 该函数将提取并保存名称和值,并更新指针以指向; or : following the entry. :输入后。 Of course this function must handle quote marks in the value and the fact that their might be ; 当然,这个函数必须处理值中的引号和它们可能是的事实; or : in the value :值中
    • (At this point the parsing pointer is always on : ) (此时解析指针始终是:
    • Pass the rest of the string to a function parse_csv which will look for comma-separated values (again, being aware of quote marks) and store the results it finds in the right place. 将字符串的其余部分传递给函数parse_csv ,该函数将查找逗号分隔的值(再次注意引号),并将找到的结果存储在正确的位置。

The functions parse_csv and extract_name_value_pair should in fact be developed and tested first. parse_csv ,应该首先开发和测试功能parse_csvextract_name_value_pair Make a test suite and check that they work properly. 做一个测试套件,并检查它们是否正常工作。 Then write your overall parser function which calls those functions as needed. 然后编写您的整体解析器函数,并根据需要调用这些函数。


Also, write all the memory allocation code as separate functions. 另外,将所有内存分配代码编写为单独的函数。 Think of what data structure you want to store your parsed result in. Then code up that data structure, and test it, entirely independently of the parsing code. 考虑一下要存储解析结果的数据结构。然后对该数据结构进行编码,并进行测试,完全独立于解析代码。 Only then, write the parsing code and call functions to insert the resulting data in the data structure. 只有这样,才能编写解析代码和调用函数,以将结果数据插入数据结构中。

You really don't want to have memory management code mixed up with parsing code. 确实不希望将内存管理代码与解析代码混在一起。 That makes it exponentially harder to debug. 这使得调试难度成倍增加。


When making a function that accepts a string (eg all three named functions above, plus any other helpers you decide you need) you have a few options as to their interface: 当制作一个接受字符串的函数时(例如上述所有三个命名函数,以及您认为需要的任何其他帮助器),它们的接口都有一些选择:

  • Accept pointer to null-terminated string 接受指向以null结尾的字符串的指针
  • Accept pointer to start and one-past-the-end 接受指针以开始和结束
  • Accept pointer to start, and integer length 接受要开始的指针和整数长度

Each way has its pros and cons: it's annoying to write null terminators everywhere and then unwrite them later if need be; 每种方法都有其优点和缺点:烦人的是,到处都写空终止符,然后在需要时取消它们; but it's also annoying when you want to use strcspn or other string functions but you received a length-counted piece of string. 但是当您想使用strcspn或其他字符串函数但收到一段长度计数的字符串时,这也很烦人。

Also, when the function needs to let the caller know how much text it consumed in parsing, you have two options: 此外,当函数需要让调用者知道其在解析中消耗了多少文本时,您有两种选择:

  • Accept pointer to character, Return the number of characters consumed; 接受字符指针,返回消耗的字符数; calling function will add the two together to know what happened 调用函数会将两者加在一起以了解发生了什么
  • Accept pointer to pointer to character, and update the pointer to character. 接受指向字符的指针,并更新指向字符的指针。 Return value could then be used for an error code. 然后可以将返回值用于错误代码。

There's no one right answer, with experience you will get better at deciding which option leads to the cleanest code. 没有一个正确的答案,根据经验,您会更好地决定哪种选项可以生成最干净的代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM