简体   繁体   English

正则表达式仅从字符串中删除开始和结束html标签?

[英]Regular expression to remove only beginning and end html tags from string?

I would like to remove for example <div><p> and </p></div> from the string below. 我想从下面的字符串中删除例如<div><p></p></div> The regex should be able to remove an arbitrary number of tags from the beginning and end of the string. 正则表达式应该能够从字符串的开头和结尾删除任意数量的标签。

<div><p>text to <span class="test">test</span> the selection on.
Kibology for <b>all</b><br>. All <i>for</i> Kibology.</p></div>

I have been tinkering with rubular.com without success. 我一直在捣碎rubular.com,但没有成功。 Thanks! 谢谢!

 def remove_html_end_tags(html_str)
   html_str.match(/\<(.+)\>(?!\W*\<)(.+)\<\/\1\>/m)[2]
 end

I'm not seeing the problem of \\<(.+)> consuming multiple opening tags that Alan Moore pointed out below, which is odd because I agree it's incorrect. 我没看到\\ <(。+)>消耗Alan Moore在下面指出的多个开始标签的问题,这很奇怪,因为我同意这是不正确的。 It should be changed to \\<([^>\\<]+)> or something similar to disambiguate. 应该将其更改为\\ <([^> \\ <] +)>或类似的歧义。

 def remove_html_end_tags(html_str)
    html_str.match(/\<([^\>\<]+)\>(?!\W*?\<)(.+)\<\/\1\>/m)[2]
 end

The idea is that you want to capture everything between the open/close of the first tag encountered that is not followed immediately by another tag, even with spaces between. 这个想法是,您要捕获遇到的第一个标签的打开/关闭之间的所有内容,而不会紧跟着另一个标签,即使它们之间有空格。

Since I wasn't sure how (with positive lookahead) to say give me the first key whose closing angle bracket is followed by at least one word character before the next opening angle bracket, I said 由于我不确定如何(以积极的前瞻性)说出给我的第一个钥匙,该钥匙的下一个尖括号后面跟着至少一个单词字符,所以我说

\>(?!\W*\<)

find the closing angle bracket that does not have all non-word characters before the next open angle bracket. 在下一个打开的尖括号之前找到没有所有非单词字符的关闭尖括号。

Once you've identified the key with that attribute, find its closing mate and return the stuff between. 一旦确定了具有该属性的键,就找到它的闭合伴侣并返回它们之间的东西。

Here's another approach. 这是另一种方法。 Find tags scanning forward and remove the first n. 查找向前扫描的标签并删除前n个标签。 Would blow up with nested tags of the same type, but I wouldn't take this approach for any real work. 会炸毁相同类型的嵌套标签,但是我不会在任何实际工作中采用这种方法。

def remove_first_n_html_tags(html_str, skip_count=0)
  matches = []
  tags = html_str.scan(/\<([\w\s\_\-\d\"\'\=]+)\>/).flatten  
  tags.each do |tag|
   close_tag = "\/%s" % tag.split(/\s+/).first
   match_str = "<#{tag}>(.+)<#{close_tag}>"
   match = html_str.match(/#{match_str}/m) 
   matches << match if match
 end
 matches[skip_count]

end 结束

Still involves some programming: 仍然涉及一些编程:

str = '<div><p>text to <span class="test">test</span> the selection on.
Kibology for <b>all</b><br>. All <i>for</i> Kibology.</p></div>'

while (m = /\A<.+?>/.match(str)) && str.end_with?('</' + m[0][1..-1])
  str = str[m[0].size..-(m[0].size + 2)]
end

Cthulhu you out there? 克苏鲁你在那里吗?

(?:\\<div.*?\\>\\<p.*?\\>)|(?:\\<\\/p\\>\\<\\/div\\>) is the expression you need. (?:\\<div.*?\\>\\<p.*?\\>)|(?:\\<\\/p\\>\\<\\/div\\>)是您需要的表达式。 But this doesn't check for every scenario... if you are trying to parse any possible combination of tags, you may want to look at other ways to parse. 但这并不能检查所有情况……如果您试图解析标记的任何可能组合,则可能需要查看其他解析方式。

Like for example, this expression doesn't allow for any whitespace between the div and p tag. 例如,此表达式不允许div和p标签之间包含任何空格。 So if you wanted to allow for that, you would add \\s* inbetween the \\>\\< sections of the tag like so: (?:\\<div.*?\\>\\s*\\<p.*?\\>)|(?:\\<\\/p\\>\\s*\\<\\/div\\>) . 因此,如果您愿意,可以在标记的\\>\\<部分之间添加\\s* ,例如: (?:\\<div.*?\\>\\s*\\<p.*?\\>)|(?:\\<\\/p\\>\\s*\\<\\/div\\>)

The div tag and the p tag are expected to be lowercase, as the expression is written. 编写表达式时,div标签和p标签应为小写字母。 So you may want to figure out a way to check for upper or lower case letters for each, so that Div or dIV would be found too. 因此,您可能想找出一种方法来检查每个字母的大写或小写字母,以便也可以找到Div或dIV。

Use gskinner's RegEx tool for testing and learning Regular Expressions. 使用gskinner的RegEx工具测试和学习正则表达式。

So your end ruby code should look something like this: 因此,您的最终红宝石代码应如下所示:

# Ruby sample for showing the use of regular expressions

str = "<div><p>text to <span class=\"test\">test</span> the selection on.
Kibology for <b>all</b><br>. All <i>for</i> Kibology.</p></div>"

puts 'Before Reguar Expression: "', str, '"'

str.gsub!(/(?:\<div.*?\>\s*\<p.*?\>)|(?:\<\/p\>\s*\<\/div\>)/, "")

puts 'After Regular Expression', str

system("pause")

EDIT: Replaced div*? 编辑:替换div*? to div.*? div.*? and replaced p*? 并替换为p*? to p.*? p.*? per suggestions in the comments. 根据评论中的建议。 EDIT: This answer doesn't allow for any set of tags, just the two listed in the first line of the question. 编辑:这个答案不允许任何标签集,只是问题第一行中列出的两个标签。

I am going to go ahead and answer my own question. 我将继续回答我自己的问题。 Below is the programmatic route: 以下是程序化路线:

The input string goes into the first loop as an array in order to remove the front tags. 输入字符串将作为数组进入第一个循环,以删除前面的标签。 The resulting string is looped through in reverse order in order to remove the end tags. 生成的字符串以相反的顺序循环通过,以删除结束标记。 The string is then reversed in order to put it in the correct order. 然后将字符串颠倒以便将其置于正确的顺序。

def remove_html_end_tags(html_str)

 str_no_start_tag = ''
 str_no_start_and_end_tag = ''

  a = html_str.split("")

     i= 0 
     is_text = false
     while i <= (a.length - 1)
       if (a[i] == '<') && !is_text
         while (a[i] != '>')
           i+= 1
         end 
          i+=1
       else
         is_text = true
          str_no_start_tag << a[i] 
         i+=1
       end
     end

    a = str_no_start_tag.split("")

    i= a.length - 1 
    is_text = false
    while i >= 0
      if (a[i] == '>') && !is_text
        while (a[i] != '<')
           i-= 1
        end 
        i-=1
      else
        is_text = true
        str_no_start_and_end_tag << a[i] 
        i-=1
      end
   end 

  str_no_start_and_end_tag.reverse!

 end

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM