简体   繁体   中英

validating csv against XSD

We have a flat XSD, so instead of storing the data in XML we are thinking of storing in CSV format as the data can be really huge. Assuming we know the Element Type of each record in CSV from XSD, is there a way to validate each record in CSV against XSD using Java based XML Validator?

The Saxon XSD validator works as a SAX filter, so you can do validation by sending SAX events that represent an XML view of the input. So all you need is a Java program that reads the CSV file and emits SAX events representing its content, where the SAX events are piped into the XSD validator.

One way of doing it would be to do the following:

  • Use the JAXB compiler to create Java classes out of your XSD
  • Use a product similar to Flatworm to automatically/declaratively parse your records (or the whole file) into the Java classes created above, or simply by hand, etc.
  • Use an approach such as the one posted here on SO to validate your graph. Just make sure you cache appropriately ie reuse the validator and the JAXBContext object.

Given the nature of the ask, the overhead incurred by marshalling to XML, even as a JAXBSource, is inevitable. What you could do is make the best out of it... If CPU bandwidth is not an issue, you could try to parallelize to increase throughput (you will need one validator per thread, JAXBContent was thread safe last time I've used it). And I would avoid loading the whole file, if one is thinking that an XSD for all the records (as in the element matching the record would be a particle with maxOccurs="unbounded") would be a more efficient way of validating... For large files, you'll run out of memory, most likely...

For large volumes of data, using XSD could be labeled elegant, but is not that efficient. For someone which runs into this post while looking for a .NET solution, validating individual fields is way more efficient (assuming the XSD has no cross-field constraints, etc.) by doing a call to XmlSchemaDatatype.ParseValue instead.

What do you mean by "flat XSD" and "element type of each record"? Obviously some conversion or adaptation process is involved in feeding a non-XML format to a validator expecting XML input. Therefore, all the relevant names must be available.

In particular, unless you have an extra column (typically at the beginning of a row) you will not have the room to encode the name of the element corresponding to the entire row. This is regardless of whether the names of the other columns in the first row are of child elements (superior) or of attributes (inferior).

Then, assuming that this name is available to the adapter, what does your "flat XSD" look like? If this element is the root or top-level element of your schema (ie the schema describes only one row) then you will have to extend the schema with a new top-level element to function as a container of a sequence of rows, which is what your CSV file is. In other words, rather than validate each row as a separate XML document, you should validate the entire CSV file, converted or otherwise adapted, as a single XML document.

If your validator can take piped input, then a CSV to XML converter written in any convenient scripting language is all you need.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM