简体   繁体   English

有条件地剥离HTML节点-Regexp / gsub

[英]Conditionally Strip HTML Node - Regexp/gsub

I want to generate a search preview of an article by removing certain html nodes including the child node(s) (particularly headers and images) and removing all other tags eg. 我想通过删除某些html节点(包括子节点(尤其是标头和图片))并删除所有其他标签(例如)来生成文章的搜索预览。 paragraph while leaving child nodes. 段落同时离开子节点。

eg 例如

"<h2>Subject</h2><p>Subject is the who, what, where, why and when.</p>".gsub(/<\/?[^>]*>/, '')

results in 结果是

Subject Subject is the who, what, where, why and when.

however I require 但是我要求

Subject is the who, what, where, why and when.

I'm using the Rails plugin Loofah to sanitize user input and this works great; 我正在使用Rails插件Loofah清理用户输入,效果很好; in fact I can define a scrubber to do this however it seems that a regexp would be sufficient for this simple operation. 实际上,我可以定义一个洗涤器来执行此操作,但是对于此简单操作而言,似乎正则表达式就足够了。

Thanks in advance for any advice. 在此先感谢您的任何建议。

Use several regexps: 使用几个正则表达式:

"<h2>Subject</h2><p>Subject is the who, what, where, why and when.</p>".
    gsub(/<h\d>[^>]*>/,'').
    gsub(/<img[^>]*>/,'').
    gsub(/<\/?[^>]*>/, '')

It should be noted however that you are reaching the limits of complexity of what regexp can handle in processing html. 但是应注意,您正在达到regexp在处理html时可以处理的复杂性的极限。 If you need to do anything even more complicated (like removing based on class name etc.) then you should really be using a html parser. 如果您需要做更复杂的事情(例如根据类名删除等),则应该使用html解析器。

尝试:

myline = line.gsub!(/(<[^>]*>)|\n|\t/s) {" "}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM