Archiving differences between time sequence of text files

Question

There is a sensor network from which I download measurements every ten minutes or on demand. Each download is a text file consisting of several lines with a timestamp and values. The name of the text file also contains a timestamp of when the download occured. So as time progresses I collect a lot of text files, which consist a sequence. Because of the physical parameters which the values are taken from, there are little to no differences between adjacent text files.

As I want to archive into a (compressed) file all of the text files that are being downloaded, in an efficient way. So I thought that archiving the differences between adjacent text files is one such way.

I want some ideas to work it out in BASH, using well-known tools like tar and diff. I know also about git, but it is not useful for creating an archive file.

I will try to clarify a bit. A text file is consisting of several lines of the following space-separated format:

timestamp sensor_uuid value_1 ... value_N

Not every line has exactly the same (say N) values, but there is little variation of tokens per line. Also the values themselves have little variation in time. As they come from sensors, and there is a single sensor per line, the number of the lines of the text file depends on how many responses I got for each call. Zero lines is possible.

Finally the text filename takes its own timestamp, a concatenation of an original name with a date time string:

sensors_2019-12-11_153043.txt for today's 15:30:43 request.

Needless to say that timestamps in the lines of this example filename are usually earlier than the filename's, or even there are lines and timestamps repeated from text files created before.

So my idea for efficient archiving is putting the first text file into the archive and then putting only the updates, ie the differences between two adjacent text files, which eventually will be tracing back to the first one text file actually archived. But at retrieving I need to get a complete text file, as if it was itself archived and not its difference from the past.

Tar takes in the whole text files, and a couple of differences between the text files' lines are not producing a repeatable pattern suitable for strong compression.

Answer 1

tar command already identifies the repeating patterns and compress them. But if you want to eliminate the parts that are repeated you can use "diff" command with some other simple manipulation of diff output and then redirect all to file.

Let's say we have 2 file "file1.txt" and "file2.txt" you can use this command line to get only the line added from the second file (file2.txt) :

diff -u file1.txt file2.txt | grep -E "^\+" | sed -E 's/^\+//' | grep -v "\+"

then we need just to redirect the output or to the same file (example file2.txt) or in another file and then delete the file2.txt before the tar operation.

Archiving differences between time sequence of text files

Question

1 answers

solution1
0 2019-12-11 13:34:42

Archiving differences between time sequence of text files

Question

1 answers

solution1 0 2019-12-11 13:34:42

solution1
0 2019-12-11 13:34:42