简体   繁体   中英

Regex to Indent an XML File

Is it possible to write a REGEX (search replace) that when run on an XML string will output that XML string indented nicely?

If so whats the REGEX :)

Doing this would be far, far simpler if you didn't use a regex. In fact I'm not even sure it's possible with regex.

Most languages have XML libraries that would make this task very simple. What language are you using?

Is it possible to write a REGEX (search replace) that when run on an XML string [...anything]

No.

Use an XML parser to read the string, then an XML serialiser to write it back out in 'pretty' mode.

Each XML processor has its own options so it depends on platform, but here is the somewhat long-winded way that works on DOM Level 3 LS-compliant implementations:

input= implementation.createLSInput();
input.stringData= unprettyxml;
parser= implementation.createLSParser(implementation.MODE_SYNCHRONOUS, null);
document= parser.parse(input);
serializer= implementation.createLSSerializer();
serializer.domConfig.setParameter("format-pretty-print", true);
prettyxml= serializer.writeToString(document);

I don't know if a regex, in isolation, could do a pretty-print format of an arbitrary XML input. You would need a regex being applied by a program to find a tag, locate the matching closing tags (if the tag is not self-closed), and so on. Using regex to solve this problem is really using the wrong tool for the job. The simplest possible way to pretty print XML is to use an XML parser, read it in, set appropriate serialization options, and then serialize the XML back out.

Why do you want to use regex to solve this problem?

Using a regex for this will be a nightmare. Keeping track of the indentation level based on the hierarchy of the nodes will be almost impossible. Perhaps perl's 5.10 regular expression engine might help since it's now reentrant. But let's not go into that road... Besides you will need to take into account CDATA sections which can embed XML declarations that need to be ignored by the indentation and preserved intact.

Stick with DOM. As it was suggested in the other answer, some libraries provide already a function that will indent a DOM tree for you. If not building one will be much simplier than creating and maintaining the regexes that will do the same task.

The dark voodoo regexp as described here works great.
http://www.perlmonks.org/?node_id=261292
Its main advantage against using XML::LibXMl and others is that it's an order of magnitude faster.

From this link :

  private static Regex indentingRegex=new Regex(@"\<\s*(?<tag>[\w\-]+)(\s+[\w\-]+\s*=\s*""[^""]*""|'[^']*')*\s*\>[^\<]*\<\s*/\s*\k<tag>\s*\>|\<[!\?]((?<=!)--((?!--\>).)*--\>|(""[^""]*""|'[^']'|[^>])*\>)|\<\s*(?<closing>/)?\s*[\w\-]+(\s+[\w\-]+\s*=\s*""[^""]*""|'[^']*')*\s*((/\s*)|(?<opening>))\>|[^\<]*", RegexOptions.ExplicitCapture|RegexOptions.Singleline);

  public static string IndentXml(string xml) {
        StringBuilder result=new StringBuilder(xml.Length*2);
        int indent=0;
        for (Match match=indentingRegex.Match(xml); match.Success; match=match.NextMatch()) {
              if (match.Groups["closing"].Success)
                    indent--;
              result.AppendFormat("{0}{1}\r\n", new String(' ', indent*2), match.Value);
              if (match.Groups["opening"].Success&&(!match.Groups["closing"].Success))
                    indent++;
        }
        return result.ToString();
  }

This would only be acheivable with multiple regexs, which will perform like a state machine.

What you are looking for is far better suited to an off the cuff parser.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM