简体   繁体   English

正则表达式解析iCalendar(Ruby正则表达式)

[英]Regex parsing of iCalendar (Ruby regex)

I'm trying to parse iCalendar (RFC2445) input using a regex. 我正在尝试使用正则表达式解析iCalendar(RFC2445)输入。

Here's a [simplified] example of what the input looks like: 这是输入内容的[简化]示例:

BEGIN:VEVENT
abc:123
def:456
END:VEVENT
BEGIN:VEVENT
ghi:789
END:VEVENT

I'd like to get an array of matches: the "outer" match is each VEVENT block and the inner matches are each of the field:value pairs. 我想得到一个匹配数组:“外部”匹配是每个VEVENT块,内部匹配是每个字段:值对。

I've tried variants of this: 我试过这个变种:

BEGIN:VEVENT\n((?<field>(?<name>\S+):\s*(?<value>\S+)\n)+?)END:VEVENT

But given the input above, the result seems to have only ONE field for each matching VEVENT, despite the +? 但鉴于上面的输入,结果似乎每个匹配的VEVENT只有一个字段,尽管+? on the capture group: 在捕获组:

**Match 1**
field   def:456
name    def
value   456

**Match 2**
field   ghi:789
name    ghi
value   789

In the first match, I would have expected TWO fields: the abc:123 and the def:456 matches... 在第一场比赛中,我预计会有两个领域:abc:123和def:456匹配......

I'm sure this is a newbie mistake (since I seem to perpetually be a newbie when it comes to regex's...) - but maybe you can point me in the right direction? 我敢肯定这是一个新手的错误(因为我似乎永远是一个新手,当谈到正则表达式...) - 但也许你可以指出我正确的方向?

Thanks! 谢谢!

Use the icalendar gem. 使用icalendar gem。 See the Parsing iCalendars section for more info. 有关详细信息,请参阅Parsing iCalendars部分。

You need a nested scan . 您需要嵌套scan

string.scan(/^BEGIN:VEVENT\n(.*?)\nEND:VEVENT$/m).each.with_index do |item, i|
  puts
  puts "**Match #{i+1}**"
  item.first.scan(/^(.*?):(.*)$/) do |k, v|
    puts "field".ljust(7)+"#{k}:#{v}"
    puts "name".ljust(7)+"#{k}"
    puts "value".ljust(7)+"#{v}"
  end
end

will give: 会给:

**Match 1**
field   abc:123
name    abc
value   123
field   def:456
name    def
value   456

**Match 2**
field   ghi:789
name    ghi
value   789

You need to split your regex up into one matching a VEVENT and one matching the name/value pairs. 您需要将正则表达式拆分为一个匹配VEVENT和一个匹配名称/值对的正则表达式。 You can then use nested scan to find all occurences, eg 然后,您可以使用嵌套scan查找所有出现的情况,例如

str.scan(/BEGIN:VEVENT((?<vevent>.+?))END:VEVENT/m) do
  $~[:vevent].scan(/(?<field>(?<name>\S+?):\s*(?<value>\S+?))/) do
    p $~[:field], $~[:name], $~[:value]
  end
end

where str is your input. str是你的输入。 This outputs: 这输出:

"abc:1"
"abc"
"1"
"def:4"
"def"
"4"
"ghi:7"
"ghi"
"7"

If you want to make the code more readable, i suggest you require 'english' and replace $~ with $LAST_MATCH_INFO 如果你想让代码更具可读性,我建议你require 'english'并用$LAST_MATCH_INFO替换$~

I think the problem is that the ruby MatchData object, which is what the regexp returns its results in, doesn't have any provision for more than one value with the same name. 我认为问题是ruby MatchData对象(正则表达式返回其结果)没有为多个具有相同名称的值提供任何规定。 So your second match overwrites the first one. 所以你的第二场比赛将覆盖第一场比赛。

Ruby has a seldom used method called slice_before that fits this need well: Ruby有一个很少使用的方法,名为slice_before ,可以很好地满足这个需求:

'BEGIN:VEVENT
abc:123
def:456
END:VEVENT
BEGIN:VEVENT
ghi:789
END:VEVENT'.split("\n").slice_before(/^BEGIN:VEVENT/).to_a

Results in: 结果是:

[["BEGIN:VEVENT", "abc:123", "def:456", "END:VEVENT"],
 ["BEGIN:VEVENT", "ghi:789", "END:VEVENT"]]    

From there it's simple to grab just the inner array elements: 从那里只需抓住内部数组元素很简单:

'BEGIN:VEVENT
abc:123
def:456
END:VEVENT
BEGIN:VEVENT
ghi:789
END:VEVENT'.split("\n").slice_before(/^BEGIN:VEVENT/).map{ |a| a[1 .. -2] }

Which is: 这是:

[["abc:123", "def:456"], ["ghi:789"]]

And, from there it's trivial to break up each resulting string using map and split(':') . 而且,从那里使用mapsplit(':')分解每个结果字符串是微不足道的。

Don't be seduced by the siren call of regular expressions trying to do everything. 不要被试图做任何事情的正则表达式的警笛声所诱惑。 They're very powerful and convenient in their particular place, but often there are simpler and easier to maintain solutions. 它们在特定的地方非常强大和方便,但通常有更简单,更容易维护的解决方案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM