简体   繁体   English

Java Regex或XML解析器?

[英]Java Regex or XML parser?

I want to remove any tags such as 我想删除任何标签,例如

<p>hello <namespace:tag : a>hello</namespace:tag></p>

to become 成为

 <p> hello hello </p>

What is the best way to do this if it is regex for some reason this is now working can anyone help? 如果正则表达式是正则表达式,出于某种原因,最好的方法是什么?有人可以帮忙吗?

(<|</)[:]{1,2}[^</>]>

edit: added 编辑:添加

Definitely use an XML parser. 绝对使用XML解析器。 Regex should not be used to parse *ML 正则表达式不应用于解析* ML

You should not use regex for these purposes use a parser like lxml or BeautifulSoup 为此,您不应使用正则表达式,而应使用lxmlBeautifulSoup等解析器

>>> import lxml.html as lxht
>>> myString = '<p>hello <namespace:tag : a>hello</namespace:tag></p>'
>>> lxht.fromstring(myString).text_content()
'hello hello'

Here is a reason why you should not parse html/xml with regex. 这是为什么您不应该使用正则表达式解析html / xml的原因

If you're just trying to pull the plain text out of some simple XML, the best (fastest, smallest memory footprint) would be to just run a for loop over the data: 如果您只是想从一些简单的XML中提取纯文本,则最好的(最快,最小的内存占用)将是对数据运行一个for循环:

PSEUDOCODE BELOW 伪码如下

bool inMarkup = false;
string text = "";
for each character in data // (dunno what you're reading from)
{
    char c = current;
    if( c == '<' ) inMarkup = true;
    else if( c == '>') inMarkup = false;
    else if( !inMarkup ) text += c;
}

Note: This will break if you encounter things like CDATA, JavaScript, or CSS in your parsing. 注意:如果您在解析过程中遇到CDATA,JavaScript或CSS之类的问题,这将中断。

So, to sum up... if it's simple, do something like above and not a regular expression. 因此,总结一下……如果很简单,请执行上述操作而不是使用正则表达式。 If it isn't that simple, listen to the other guys an use an advanced parser. 如果不是那么简单,请听其他人使用高级解析器。

This is a solution I personally used for a likewise problem in java. 这是我个人用于解决Java中同样问题的一种解决方案。 The library used for this is Jsoup : http://jsoup.org/ . 用于此目的的库是Jsoup: http ://jsoup.org/。

In my particular case I had to unwrap tags that had an attribute with a particular value in them. 在我的特定情况下,我必须拆开标签,这些标签中的属性具有特定的值。 You see that reflected in this code, it's not the exact solution to this problem but could put you on your way. 您会看到此代码中反映了这一点,它不是解决此问题的确切方法,但可能会让您陷入困境。

  public static String unWrapTag(String html, String tagName, String attribute, String matchRegEx) {
    Validate.notNull(html, "html must be non null");
    Validate.isTrue(StringUtils.isNotBlank(tagName), "tagName must be non blank");
    if (StringUtils.isNotBlank(attribute)) {
      Validate.notNull(matchRegEx, "matchRegEx must be non null when an attribute is provided");
    }    
    Document doc = Jsoup.parse(html);
    OutputSettings outputSettings = doc.outputSettings();
    outputSettings.prettyPrint(false);
    Elements elements = doc.getElementsByTag(tagName);
    for (Element element : elements) {
      if(StringUtils.isBlank(attribute)){
        element.unwrap();
      }else{
        String attr = element.attr(attribute);
        if(!StringUtils.isBlank(attr)){
          String newData = attr.replaceAll(matchRegEx, "");
          if(StringUtils.isBlank(newData)){
            element.unwrap();
          }
        }        
      }
    }
    return doc.html();
  }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM