How do I copy portions of an xml file

Question

I have an xml file that is relatively large. The client sends me a complete file even though I only need portions of the file. As a result, I would like to parse out the parts that I need and process this new file.

Here is a portion of the xml

<Activity>
    <RetailFormat>ABC</RetailFormat>
    <FeedDate>2014-02-06 21:01:10</FeedDate>
    <ActivityId>665507</ActivityId>
    <ActivityTitle>ABC 3.9.14 Hawaii </ActivityTitle>
    <StartDate>2014-03-09</StartDate>
    <EndDate>2014-03-15</EndDate>
    <StartTime>00:00:00</StartTime>
    <EndTime>23:59:59</EndTime>
    <JANumber>0</JANumber>
    <PlanItemNo>0</PlanItemNo>
    <ChannelType>Circular</ChannelType>
    <Version>
    </Version>
</Activity>

I have a list of ActivityIDs that I need to search for. If the ActivityID is in the list I want to copy the entire Activity into a new file. If not, I want to move to the next Activity. The is actually several hundred lines down from the start tag. I have not worked with xml other than to manually parse out sections. I don't know if there is a programmatic way to deal with this issue. Also, I need perhaps 15K lines out of this file. The file has 1.3MM lines in it. By limiting the size of the processed file, I can cut my processing time dramatically.

I am looking for the most efficient way to attack this problem. I am fine with doing it manually for a while, but I would prefer to limit it sooner rather than later.

Answer 1

If the file is very big and memory use is a concern, you should use a SAX parser (in your language of choice - add it to your tags). SAX does not work with trees, so you have to rebuild the subtrees yourself while your parse. The advantage is that it doesn't have to load the whole XML into memory. You only store what you really need.

A SAX parser is an event-based XML parser which will read your file sequentially and produce events. Events are handled in methods like startElement(...) , startDocument(...) , endElement(...) , characters(...) , etc. You have to write a handler to capture the events you wish to handle implementing these methods.

Your handler will have to implement startElement() , characters() and endElement() , and use instance variables to save relevant data you will need between methods (ex: current-element, an array to store your code fragments, etc.

If memory is not a problem, you can use DOM or XSLT. With DOM you can use getElementsByTagName("Activity") to retrieve an array of <Activity> subtrees, and then check the <ActivityID> using DOM methods on that subtree. Then you can copy the subtrees you want, adding them to another root, or removing the ones you don't want from the current root.

Using XSLT you can write a XML template which selects all <Activity> nodes with an XPath expression such as //Activity , check the ID comparing //Activity/ActivityID to your list of IDs and producing a result tree with only the Activity nodes you want.

Inform the language you are using and I might be able to send you some examples.

How do I copy portions of an xml file

Question

1 answers

solution1
2 ACCPTED 2014-02-12 16:34:17

How do I copy portions of an xml file

Question

1 answers

solution1 2 ACCPTED 2014-02-12 16:34:17

solution1
2 ACCPTED 2014-02-12 16:34:17