简体   繁体   中英

How to parse JSON log file with Streaming API in Java, then output tabulated log file

I have a problem at hand where I am trying to parse large log files stored in JSON format, and then tabulate the data and output it as another JSON file. Following is the format of the log files that I am parsing:

{
"timestamp": "2012-10-01TO1:00:00.000",
"id": "someone@somewhere.net",
"action": "Some_Action"
"responsecode": "1000"
}

The action here is the action that some user performs, and the response code is the result of that action.

The timestamp and id are actually irrelevant for my tabulation, and I am only interested in the action/code fields. There may be tens of thousands of these entries in any given log file, and what I want to do is keep track of all the types of action's , the responsecode and their respective number of occurrences.

Below would be a sample of the output I am looking to generate.

{"actionName": "Some_User_Action",
"responses": [{"code": "1000", "count": "36"},
              {"code": "1001", "count": "6"},
              {"code": "1002", "count": "3"},
              {"code": "1003", "count": "36"},
              {"code": "1004", "count": "2"}],
"totalActionCount": "83"}

So basically, for each Action, I want to keep track of all the different responses it generates, and the number of times each occurred. Finally I want to keep track of the total number of responses for that action in total.

Currently, I have created a Java class for the output object in which I plan to store the output data. I am also a little bit confused with the format I should be storing the array of responses and their respective count numbers. The total number of response code types varies depending on the Action as well.

Based upon my research it seems that I will be needing to make use of JSON parsing using a Streaming API. The reason for using Streaming API is mainly due to the amount of memory overhead using a non-streaming API would need, which is likely not possible with the size of these log files. I am currently considering using Jackson or GSON, but I am unable to find any concrete examples or tutorials to get me started. Does anyone know of a good example that I could study or have any hints on how I go about solving this problem? Thanks you!

EDIT: My class definition.

public class Action {



public static class Response {

    private int _resultCode;
    private int _count = 0;

    public Response() {}

    public int getResultCode() { return _resultCode; }
    public int getCount() { return _count; }

    public void setResultCode(int rc) { _resultCode = rc; }
    public void setCount(int c) { _count = c; }

}

private List<Response> responses = new ArrayList<Response>();
private String _name;

// I've left out the getters/setters and helper functions that I will add in after.

}

If I am using Jackson, and want to eventually be able to serialize this object easily back into JSON, are there any suggestions with regards to how I define this class? At the moment I am creating another ArrayList of this Action type in my main() method using: List actions = new ArrayList(); Is using HashMaps or other alternatives a better option? Also, will it allow me to easily serialize it to JSON afterwards using Jackson?

Ok, to start, with Jackson you can combine data-binding with streaming. All you need is a JsonParser (created using JsonFactory , instance of which can be gotten from ObjectMapper , or constructed directly). You can then advance stream to first entry, and from there on just use data-binding ( ObjectMapper.readValue(...) ). This will only read minimum needed to get the single value instance you want.

Or even better, use "readValues()" method, once you reach the array

ObjectMapper mapper = new ObjectMapper();
JsonParser jp = mapper.getJsonFactory().createJsonParser(sourceFile);
while (jp.nextToken() != JsonToken.START_ARRAY) { }
MappingIterator<Response> it = mapper.readValues(jp, Entry.class);
while (it.hasNextValue()) {
   Response value = it.nextValue();
   // process it; keep count, whatever
}

And to output, you might want to consider Jackson CSV module : it can write entries using one of CSV variants; and you can redefine separators to whatever you like. See project README for examples.

You can have a look at Genson library http://code.google.com/p/genson/ , on the wiki page you will find some examples on how to use it. Since its first release it provides the streaming model and seems to be the fastest after Jackson, see the benchmarks .

If you want to do something really efficient and with a small memory foot print use directly the streaming api by instanciating a JsonReader and then use it to read the logged structure and increment your counters.

Otherwise you could use a Genson instance to parse your file directly to java objects, but in your case I don't think it is the right solution as it will require you to store all the objects in memory!

Here is a quick example by using directly the streaming api. It will not print exactly the structure you are expecting as it requires more code to count efficiently with your structure :

public static void main(String[] args) throws IOException, TransformationException {
    Map<String, Map<String, Integer>> actions = new HashMap<String, Map<String, Integer>>();
    Genson genson = new Genson();

    ObjectReader reader = genson.createReader(new FileReader("path/to/the/file"));
    while(reader.hasNext()) {
        reader.next();
        reader.beginObject();
        String action = readUntil("action", reader);
        // assuming the next name/value pair is responsecode
        reader.next();
        String responseCode = reader.valueAsString();
        Map<String, Integer> countMap = actions.get(action);
        if (countMap == null) {
            countMap = new HashMap<String, Integer>();
            actions.put(action, countMap);
        }

        Integer count = countMap.get(responseCode);
        if (count == null) {
            count = 0;
        }
        count++;
        countMap.put(responseCode, count);

        reader.endObject();
    }

    // for example if you had 2 different response codes for same action it will print
    // {"Some_Action":{"1001":1,"1000":1}}
    String json = genson.serialize(actions);
}

static String readUntil(String name, ObjectReader reader) throws IOException {
    while(reader.hasNext()) {
        reader.next();
        if (name.equals(reader.name())) {
            return reader.valueAsString();
        }
    }
    throw new IllegalStateException();
}

You can parse your records one by one, so I don't think the memory consumption of the JSON structures exceeds a few kilobytes. Just create

class Something {
    String action;
    int responsecode;
    // do not include the fields you don't need
}

and read one record in each step. Guava's HashMultiset<String, Integer> with its methods put , count , and size gives you everything you need. In case you run out of the memory (because of the huge Multimap), you'll probably need a database instead, but I'd try the simple solution first.

For the output JSON you may need GSON's TypeAdapter or JsonSerializer . Or as a hack, you can easily generate the output manually.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM