简体   繁体   中英

Generating Verifyiable random numbers - Java

I am trying to validate a properietery database (actually, a file system, but for this discussion, I want to keep this simple). The database has the following properties:

It can have either 1 or 2 primary keys, and they MUST be integers. Columns could be string (non-ascii permitted), integer, long, or datetime

I want to validate that the values I ask this database to store are correctly stored with a large number of records (> 500k records). So for this, I want to extend a tool that generates data that I can easily validate later.

So basically, say this is the sample schema:

pk1 (int - primary key)
pk2 (int - primary key)
s1 (string)
l1 (long)
i1 (int)

I want to generate 500k records with this tool. Then, at any given time, I want to be able to sanity check a given record. I might perform a series of operations (say backup, then restore the database), and then "spot check" few records. So I want to be able to quickly validate that the entry for record for primary key (pk1 = 100, pk2 = 1) is valid.

What is the best way to go about generating the values for each column such that it can be easily validated later. The values need not be fully random, but they should not repeat frequently either, so some of the compression logic could be hit too.

As an example, say "somehow" the tool generated the following value for a row:

pk1 = 1000
pk2 = 1
s1 = "foobar"
l1 = 12345
i1 = 17

Now I perform several operations, and I want to validate that at the end of this, this row has not corrupted. I have to be able to quickly generate expected values for s1, l1, and i1 - given pk1=1000 and pk2=1 - so it can be validated really quickly.

Ideas?

(I can't post answer to my own question since I am a new used, so adding this:) Ok, so I have to possible approaches I could pursue:

Approach# 1: use HASH(tablename) ^ HASH(fieldname) ^ pk1 ^ pk2 as the seed. This way, I can easily compute the seed for each column when validating. On the flip side, this could be expensive when generating data for lots of rows since the seed need to computed once per column. So for the above schema, I would have 500k*3 seeds (to generate 500k records).

Approach# 2 (Proposed by Philipp Wendler): Generate one seed per row, and store the seed in the first column of that row. If the first column is an int or long, store the value as-is. If the first column is a string, store the seed in the first x bytes, and then pad it upto the required string length with characters generated using that seed.

I like approach #2 better because there is just one seed per row - making the data generation somewhat faster than approach #1.

You could just generate arbitrary random data, calculate an hash code (MD5 for example, as it doesn't need to be cryptographically secure) and store the hash code with your data. You can have a separate column for the hash code, or for example you can append it to any string column.

For verifying, separate the stored hash code from the rest of the data in that row, re-calculate the hash code and compare them for equality. If they don't match, your data was modified.

This assumes that you want to protect you data only from accidental modifications (not from a malicious attacker).

This answers only second part of your question - what about making l1 storing hash of all the other fields? Then you can quickly verify if anything is corrupted

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM