简体   繁体   中英

How can I skip sync markers when comparing two avro files filled up with similar data?

Could somebody please suggest on how can I compare two avro files which contain identical data? My application serializes DB data (which is presumably static) to avro on daily basis. Intention is to compare newly generated files with their previous versions. This is driven by Java. Currently I'm following an approach of row-to-row comparing. It suits my needs almost perfectly. The only problem is that avro Object Container Files contain 16-byte sync markers at the end of both avro file header and file data block. These sync markers are generated automatically for each new avro file. An example of avro file taken from web is below:

Objavro.codecnullavro.schemaò{"type":"record","name":"twitter_schema","namespace":"com.miguno.avro","fields":[{"name":"username","type":"string","doc":"Name of the user account on Twitter.com"},{"name":"tweet","type":"string","doc":"The content of the user's Twitter message"},{"name":"timestamp","type":"long","doc":"Unix epoch time in milliseconds"}],"doc:":"A basic schema for storing Twitter messages"}ì7ê,Hz[ÅìÈÈmigunoFRock: Nerf paper, scissors is fine.²žî
BlizzardCSFWorks as intended.  Terran is IMBA.âóî
ì7ê,Hz[ÅìÈ

As could be seen ì7ê,Hz[ÅìÈ are sync markers which cause problems to my logic. This makes two avro files created on the same data not to be identical.

When writing Avro files with the DataFileWriter , you can manually specify a sync marker in the create method . If you use a fixed sync marker in your application between runs, the files should be identical if the objects haven't changed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM