简体   繁体   中英

Efficiently parse the huge string response

I have a service which returns back the data in the below format. I have shortened it down for understanding but in general this is pretty big response. Format is always going to be the same.

process=true
version=2
DataCenter=dc2
    Total:2
    prime:{0=1, 1=2, 2=3, 3=4, 4=1, 5=2}
    obvious:{0=6, 1=7, 2=8, 3=5, 4=6}
    mapping:{3=machineA.dc2.com, 2=machineB.dc2.com}
    Machine:[machineA.dc2.com, machineB.dc2.com]
DataCenter=dc1
    Total:2
    prime:{0=1, 1=2, 2=3, 3=4, 4=1, 5=2, 6=3}
    obvious:{0=6, 1=7, 2=8, 3=5, 4=6, 5=7}
    mapping:{3=machineP.dc1.com, 2=machineQ.dc1.com}
    Machine:[machineP.dc1.com, machineQ.dc1.com]
DataCenter=dc3
    Total:2
    prime:{0=1, 1=2, 2=3, 3=4, 4=1, 5=2}
    obvious:{0=6, 1=7, 2=8, 3=5, 4=6}
    mapping:{3=machineO.dc3.com, 2=machineR.dc3.com}
    Machine:[machineO.dc3.com, machineR.dc3.com]

I am trying to parse the above data and store it in three different Maps.

  • Prime map: Map<String, Map<Integer, Integer>> prime = new HashMap<String, Map<Integer, Integer>>();
  • Obvious map: Map<String, Map<Integer, Integer>> obvious = new HashMap<String, Map<Integer, Integer>>();
  • Mapping map: Map<String, Map<Integer, String>> mapping = new HashMap<String, Map<Integer, String>>();

Below is the description:

  • In Prime map, key will be dc2 and the value will be {0=1, 1=2, 2=3, 3=4, 4=1, 5=2} .
  • In Obvious map, key will be dc2 and the value will be {0=6, 1=7, 2=8, 3=5, 4=6} .
  • In Mapping map, key will be dc2 and the value will be {3=machineA.dc2.com, 2=machineB.dc2.com} .

Similarly for other datacenters as well.

What is the best way to parse the above string response? Should I use regex here or simple string parsing?

public class DataParser {
    public static void main(String[] args) {
        String response = getDataFromURL();
        // here response will contain above string
        parseResponse(response);            
    }

    private void parseResponse(final String response) {
        // what is the best way to parse the response?
    }   
}

Any example will be of great help.

You can do like ShellFish recommends and split the response by '\\n' and then process each line.

One regex approach would be like the following (It's incomplete, but is enough to get you started):

public static void main(String[] args) throws Exception {
    String response = "process=true\n" +
        "version=2\n" +
        "DataCenter=dc2\n" +
        "    Total:2\n" +
        "    prime:{0=1, 1=2, 2=3, 3=4, 4=1, 5=2}\n" +
        "    obvious:{0=6, 1=7, 2=8, 3=5, 4=6}\n" +
        "    mapping:{3=machineA.dc2.com, 2=machineB.dc2.com}\n" +
        "    Machine:[machineA.dc2.com, machineB.dc2.com]\n" +
        "DataCenter=dc1\n" +
        "    Total:2\n" +
        "    prime:{0=1, 1=2, 2=3, 3=4, 4=1, 5=2, 6=3}\n" +
        "    obvious:{0=6, 1=7, 2=8, 3=5, 4=6, 5=7}\n" +
        "    mapping:{3=machineP.dc1.com, 2=machineQ.dc1.com}\n" +
        "    Machine:[machineP.dc1.com, machineQ.dc1.com]\n" +
        "DataCenter=dc3\n" +
        "    Total:2\n" +
        "    prime:{0=1, 1=2, 2=3, 3=4, 4=1, 5=2}\n" +
        "    obvious:{0=6, 1=7, 2=8, 3=5, 4=6}\n" +
        "    mapping:{3=machineO.dc3.com, 2=machineR.dc3.com}\n" +
        "    Machine:[machineO.dc3.com, machineR.dc3.com]";

    Map<String, Map<Integer, Integer>> prime = new HashMap();
    Map<String, Map<Integer, Integer>> obvious = new HashMap();
    Map<String, Map<Integer, String>> mapping = new HashMap();

    String outerMapKey = "";
    int findCount = 0;
    Matcher matcher = Pattern.compile("(?<=DataCenter=)(.*)|(?<=prime:)(.*)|(?<=obvious:)(.*)|(?<=mapping:)(.*)").matcher(response);
    while(matcher.find()) {
        switch (findCount) {
            case 0:
                outerMapKey = matcher.group();
                break;
            case 1:
                prime.put(outerMapKey, new HashMap());
                String group = matcher.group().replaceAll("[\\{\\}]", "").replaceAll(", ", ",");
                String[] groupPieces = group.split(",");
                for (String groupPiece : groupPieces) {
                    String[] keyValue = groupPiece.split("=");
                    prime.get(outerMapKey).put(Integer.parseInt(keyValue[0]), Integer.parseInt(keyValue[0]));
                }
                break;
            // Add additional cases for obvious and mapping
        }

        findCount++;
        if (findCount == 4) {
            findCount = 0;
        }
    }

    System.out.println("Primes:");
    prime.keySet().stream().forEach(k -> System.out.printf("Key: %s Value: %s\n", k, prime.get(k)));
    // Add additional outputs for obvious and mapping
}

Results:

Primes:
Key: dc2 Value: {0=0, 1=1, 2=2, 3=3, 4=4, 5=5}
Key: dc1 Value: {0=0, 1=1, 2=2, 3=3, 4=4, 5=5, 6=6}
Key: dc3 Value: {0=0, 1=1, 2=2, 3=3, 4=4, 5=5}

References to explain the regex pattern: http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

http://www.regular-expressions.info/lookaround.html

The answer depends on how much you trust the format to be be fixed and exact. A very simple approach parses the string and does minimal string compare to determine the key value:

private static final String DATA_CENTER = "DataCenter=";
private static final int DATA_CENTER_LEN = DATA_CENTER.length();
private static final String PRIME = "    prime:";
private static final int PRIME_LEN = PRIME.length();
// etc.
Map<String, Map<Integer, Integer>> prime = new HashMap<>();
// etc.
String response = "...";
Scanner scanner = new Scanner( response );
while(scanner.hasNextLine()){
    String line = scanner.nextLine();
    if( line.startsWith( DATA_CENTER ) ){
        String dc = line.substring( DATA_CENTER_LEN );
        line = scanner.nextLine(); // skip Total 
        prime.put( dc, str2map(scanner.nextLine().substring(PRIME_LEN)) );
        obvious.put( dc, str2map(scanner.nextLine().substring(OBVIOUS_LEN)) );
        mapping.put( dc, str2mapis(scanner.nextLine().substring(MAPPING_LEN)) );
    }
}

More explicit nextLine() calls would avoid even the test for "DataCenter".

Here's a couple of almost identical methods to split the braces and create a map:

private static Map<Integer,Integer> str2map( String str ){
    Map<Integer,Integer> map = new HashMap<>();
    str = str.substring( 1, str.length()-1 );
    String[] pairs = str.split( ", " );
    for( String pair: pairs ){
        String[] kv = pair.split( "=" );
        map.put( Integer.parseInt(kv[0]),Integer.parseInt(kv[1]) );
    }
    return map;
}

private static Map<Integer,String> str2mapis( String str ){
    Map<Integer,String> map = new HashMap<>();
    //...
        map.put( Integer.parseInt(kv[0]),kv[1] );
    }
    return map;
}

If there's the possibility that the white space might vary, you could stay on the safe side, using

private static final String PRIME = "prime:";
// ...
prime.put( dc, str2map(scanner.nextLine().trim().substring( PRIME_LEN )) );

If the sequence or completeness of lines isn't guaranteed, testing may be required:

line = scanner.nextLine().trim();
if( line.startsWith( PRIME ) ){
     prime.put( dc, str2map(scanner.nextLine().substring( PRIME_LEN )) );
}

With even less stability/trust regular expression parsing might be indicated.

I would do simple string parsing in this case, applying for each line. In pseudo code, something like this:

for line in response
    if line matches /^DataCenter/
         key = datacenter name
    else if line matches / *prime/
         prime.put(key, prime value)
    else if line matches / *obvious/
         obvious.put(key, obvious value)
    else if line matches / *mapping/
         mapping.put(key, mapping value)
    else
         getline

You could optimize here by first checking the first char of the line. If it's anything besides a space or a D , you can go to the next line. If the format is always the same you could even hardcode the lines to parse. In the example you supplied you could do:

skip 2 lines
repeat
    extract datacenter name
    skip 1 line
    extract prime
    extract obvious
    extract mapping
    add above stuff to the maps
    skip 1 line
until EOF

This will be a lot faster but will fail if the format changes.

You could use a Parser Generator such as ANTLR, or you could hand code the parser. Depending on how much output you have to process and how often, you may find that going to such trouble isn't really worth it, and that just going over each line and manually parsing it (eg, regex or indexOf) is sufficient and clear enough.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM