简体   繁体   English

Ruby Regex捕获两个字符串(包括两个字符串)之间的所有内容

[英]Ruby Regex to capture everything between two strings (inclusive)

I'm trying to sanitize some HTML and just remove a single tag (and I'd really like to avoid using nokogiri, etc). 我正在尝试清理一些HTML并只删除一个标签(我真的很想避免使用nokogiri等)。 So I've got the following string appearing I want to get rid of: 所以我出现了以下要删除的字符串:

<div class="the_class>Some junk here that's different every time</div>

This appears exactly once in my string, and I'd like to find a way to remove it. 这在我的字符串中仅出现一次,因此我想找到一种删除它的方法。 I've tried coming up with a regex to capture it all but I can't find one that works. 我曾尝试提出一个正则表达式来捕获所有内容,但找不到有效的正则表达式。

I've tried /<div class="the_class">(.*)<\\/div>/m and that works, but it'll also match up to and including any further </div> tags in the document, which I don't want. 我已经尝试过/<div class="the_class">(.*)<\\/div>/m了,但是它也可以匹配并包含文档中的其他</div>标记,我不要

Any ideas on how to approach this? 关于如何解决这个问题的任何想法?

I believe you're looking for an non-greedy regex, like this: 我相信您正在寻找一个非贪婪的正则表达式,例如:

/<div class="the_class">(.*?)<\/div>/m

Note the added ? 注意添加了? . Now, the capturing group will capture as little as possible (non-greedy), instead of as most as possible (greedy). 现在,捕获组将捕获尽可能少的(非贪婪),而不是捕获尽可能多的(贪婪)。

Because it adds another dependency and slows my work down. 因为它增加了另一个依赖关系并降低了我的工作速度。 Makes things more complicated. 使事情变得更复杂。 Plus, this solution is applicable to more than just HTML tags. 另外,该解决方案不仅适用于HTML标签,而且还适用于其他应用。 My start and end strings can be anything. 我的开始和结束字符串可以是任何东西。

I used to think the same way until I got a job writing spiders and web-site analytics, then writing a big RSS-aggregation system -- A parser was the only way out of that madness. 以前我一直以相同的方式思考,直到找到一份工作来编写蜘蛛和网站分析,然后编写了一个大型RSS聚合系统-解析器是摆脱这种疯狂的唯一方法。 Without it the work would never have been finished. 没有它,这项工作将永远无法完成。

Yes, regex are good and useful, but there are dragons waiting for you. 是的,正则表达式既好又有用,但是有龙在等你。 For instance, this common string will cause problems: 例如,此通用字符串将引起问题:

'<div class="the_class"><div class="inner_div">foo</div></div>'

The regex /<div class="the_class">(.*?)<\\/div>/m will return: 正则表达式/<div class="the_class">(.*?)<\\/div>/m将返回:

"<div class=\"the_class\"><div class=\"inner_div\">foo</div>"

This malformed, but renderable HTML: 格式错误但可呈现的HTML:

<div class="the_class"><div class="inner_div">foo

is even worse: 更糟糕的是:

'<div class="the_class"><div class="inner_div">foo'[/<div class="the_class">(.*?)<\/div>/m]
=> nil

Whereas, a parser can deal with both: 而解析器可以处理以下两种情况:

require 'nokogiri'
[
  '<div class="the_class"><div class="inner_div">foo</div></div>',
  '<div class="the_class"><div class="inner_div">foo'
].each do |html|
  doc = Nokogiri.HTML(html)
  puts doc.at('div.the_class').text
end

Outputs: 输出:

foo
foo

Yes, your start and end strings could be anything, but there are well-recognized tools for parsing HTML/XML, and as your task grows the weaknesses in using regex will become more apparent. 是的,您的开始和结束字符串可以是任何东西,但是有公认的用于解析HTML / XML的工具,并且随着您的任务的增长,使用正则表达式的弱点将更加明显。

And, yes, it's possible to have a parser fail. 而且,是的,解析器可能会失败。 I've had to process RSS feeds that were so badly malformed the parser blew up, but a bit of pre-processing fixed the problem. 我必须处理严重破坏了解析器格式的RSS提要,但是一些预处理解决了这个问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM