简体   繁体   English

归档文本文件时间序列之间的差异

[英]Archiving differences between time sequence of text files

There is a sensor network from which I download measurements every ten minutes or on demand.有一个传感器网络,我每十分钟或按需下载测量值。 Each download is a text file consisting of several lines with a timestamp and values.每个下载都是一个文本文件,由几行组成,带有时间戳和值。 The name of the text file also contains a timestamp of when the download occured.文本文件的名称还包含下载发生的时间戳。 So as time progresses I collect a lot of text files, which consist a sequence.所以随着时间的推移,我收集了很多文本文件,其中包含一个序列。 Because of the physical parameters which the values are taken from, there are little to no differences between adjacent text files.由于取值的物理参数,相邻文本文件之间几乎没有差异。

As I want to archive into a (compressed) file all of the text files that are being downloaded, in an efficient way.因为我想以一种有效的方式将所有正在下载的文本文件存档到(压缩)文件中。 So I thought that archiving the differences between adjacent text files is one such way.所以我认为归档相邻文本文件之间的差异就是这样一种方式。

I want some ideas to work it out in BASH, using well-known tools like tar and diff.我想要一些想法在 BASH 中解决,使用众所周知的工具,如 tar 和 diff。 I know also about git, but it is not useful for creating an archive file.我也知道 git,但它对创建存档文件没有用。

I will try to clarify a bit.我会试着澄清一下。 A text file is consisting of several lines of the following space-separated format:文本文件由以下空格分隔格式的多行组成:

timestamp sensor_uuid value_1 ... value_N时间戳sensor_uuid value_1 ... value_N

Not every line has exactly the same (say N) values, but there is little variation of tokens per line.并非每一行都具有完全相同的(比如 N 个)值,但每行标记的变化很小。 Also the values themselves have little variation in time.此外,值本身在时间上几乎没有变化。 As they come from sensors, and there is a single sensor per line, the number of the lines of the text file depends on how many responses I got for each call.由于它们来自传感器,并且每行有一个传感器,因此文本文件的行数取决于每次调用得到的响应数量。 Zero lines is possible.零线是可能的。

Finally the text filename takes its own timestamp, a concatenation of an original name with a date time string:最后,文本文件名采用自己的时间戳,原始名称与日期时间字符串的串联:

sensors_2019-12-11_153043.txt for today's 15:30:43 request. sensor_2019-12-11_153043.txt用于今天 15:30:43 的请求。

Needless to say that timestamps in the lines of this example filename are usually earlier than the filename's, or even there are lines and timestamps repeated from text files created before.毋庸置疑,此示例文件名的行中的时间戳通常早于文件名,甚至之前创建的文本文件中存在重复的行和时间戳。

So my idea for efficient archiving is putting the first text file into the archive and then putting only the updates, ie the differences between two adjacent text files, which eventually will be tracing back to the first one text file actually archived.因此,我对高效归档的想法是将第一个文本文件放入归档中,然后只放入更新,即两个相邻文本文件之间的差异,最终将追溯到实际归档的第一个文本文件。 But at retrieving I need to get a complete text file, as if it was itself archived and not its difference from the past.但是在检索时,我需要获得一个完整的文本文件,就好像它本身是存档的,而不是与过去的区别。

Tar takes in the whole text files, and a couple of differences between the text files' lines are not producing a repeatable pattern suitable for strong compression. Tar 接收整个文本文件,文本文件行之间的一些差异不会产生适合强压缩的可重复模式。

tar command already identifies the repeating patterns and compress them. tar 命令已经识别重复模式并压缩它们。 But if you want to eliminate the parts that are repeated you can use "diff" command with some other simple manipulation of diff output and then redirect all to file.但是如果你想消除重复的部分,你可以使用“diff”命令和其他一些简单的 diff 输出操作,然后所有内容重定向到文件。

Let's say we have 2 file "file1.txt" and "file2.txt" you can use this command line to get only the line added from the second file (file2.txt) :假设我们有 2 个文件“file1.txt”“file2.txt”,您可以使用此命令行仅获取从第二个文件 (file2.txt) 添加的行:

diff -u file1.txt file2.txt | grep -E "^\+" | sed -E 's/^\+//' | grep -v "\+" 

then we need just to redirect the output or to the same file (example file2.txt) or in another file and then delete the file2.txt before the tar operation.那么我们只需要将输出重定向到同一个文件(例如 file2.txt)或另一个文件中,然后在 tar 操作之前删除 file2.txt。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM