如何使用Shell脚本读取其中包含一些文本字符串的二进制文件？

Question

I have a file whose name is 142490.1 and that file will have content like this - 我有一个文件名为142490.1 ，该文件将包含以下内容-

^A^A^@^@^@=^@^@=y^B^@e^A^C^@f^B^HÂ¬^\ÂA^Y^A^G^B<81>s
^A^@G@client.1424906160996.30431.DC1.5faa5c2a-c382-40b8-baa8-234a8e6ecd19^@^@^A^F<8b>f^@Ã¸^@y^@^@^AKÃ^F<86>T^@^@^@ÃªÃµ^A\^@^R304344351^N2047675^@^D77^@^Y^W^B^@
27.99^@^X261449949761^@Ã^O^@<92>^NICHOLSON Baseball     ^V|t -S M L XL XXL(2)^@
15724^@
63862^U^GÃ°V11450^@^B7^@<9a>^A^@^L823196^@Â¨<99>Â´Â°Ã¸R^B^@^TBj%2FRZUw*^@^PBoZf8jU*^@^T1032869222^B^@&LH_DefaultDomain_77^@^@^A^@^@H@client.1424906160992.116975.DC1.344073e8-93f6-487c-b343-7923080f07aa^@^@^AKÃ^F<8b>f^@Â^@y^@^@^AKÃ^EÃ²<9f>Â£^AX^@^T1169755138^N2047935^@^B3.^W^@Ã°^?^B^@^H0.99^@^X171689807229^B^@rTOPSHOP LEATHER 3 EU 36^B^B^@
45333^B^B^@^F^@^L161103^@Ã°ÃÂ¯Â°Ã¸R^B^B^@^PBosZQlE*^B^B^B^@^@^A^@^@G@client.1424906160976.1295684.DC1.66a6ca77-30ee-4d50-b7ea-4a524eb94af1^@^@^AKÃ^F<8b>f^@Â¤^@y^@^@^AKÃ^F<89>^O^@^@^@<96><9a>^AT^@^R129569484^N2047935^@^B3^]^V^B^@^F499^853759648^B^@bWILLIS AND^B^B^@
20489^B^B^@^F^@^P-1404420^@<9e>Â¤Â´Â°Ã¸R^B^B^@^PBop4ml0*^B^B^B^@^@^A^@^@H@client.1424906160989.104826.DC1.4d58c06a-3526-408a-a48b-8bdc82b94dba^@^@^AKÃ^F<8b>f^@Â¨^@R^@^@^AKÃ^F<83>Â¶^@^@^@<9a>Â·^AX^@^T1048328026^N2045573^@^B0.^W^@^P^B^B^^AÃ°@^@^H6000^@^Z1955 corvette^@Ã¬<8e>Â´Â°Ã¸R^B^@^PBiZzFm8*^@^PBoO8YKc*^@^@^A^@

I know above file content looks mainly binary but there are some strings in the file which we can read it clearly. 我知道上面的文件内容主要是二进制文件，但是文件中有一些字符串，我们可以清楚地读取它。

If you see the above file content, you will see a string like this - 如果您看到上述文件内容，则会看到类似这样的字符串-

@client.1424906160996.30431.DC1.5faa5c2a-c382-40b8-baa8-234a8e6ecd19

In the above string 1424906160996 is a timestamp. 在上面的字符串1424906160996是一个时间戳。

ProblemStatement: 问题陈述：

I need to find all the strings which starts with @client and whose timestamp is one minute old as compared to current timestamp. 我需要找到所有以@client并且时间戳比当前时间戳@client一分钟的字符串。

Let's say if below are the strings which starts with @client and whose timestamp is one minute older as compared to current timestamp, then it should print out like this after reading the file - 假设下面是以@client且时间戳比当前时间戳大一分钟的字符串，则在读取文件后应将其打印出来-

@client.1424906161996.3031.DC1.5faaa-c382-40b8-baa8-234a8ed19
@client.1424906162996.3041.DC1.5a5c2a-c382-40b8-baa8-238e6ec9
@client.1424906163996.3043231.DC1.5faa2a-c382-40b8-baa8-23e6ed19
@client.1424906164996.3016731.DC1.5faa5a-c382-40b8-baa8-234ad19

Is there any way to do this using shell script which can read the above file and print out those strings which starts with @client and whose timestamp is older than 1 minute. 有什么方法可以使用Shell脚本来读取上面的文件并打印出以@client且时间戳早于1分钟的字符串。

I have Ubuntu 12.04 running. 我正在运行Ubuntu 12.04。

Answer 1

The simplest way to extract the data is by using the strings utility, telling it to scan the whole file, eg, 提取数据的最简单方法是使用字符串实用程序，告诉它扫描整个文件，例如，

strings - inputfile | egrep '@client(\.[[:xdigit:]]+)+(-[[:xdigit:]]+)+'

but as noted in the other example, there is still the timestamp to consider. 但是，如另一个示例中所述，仍然需要考虑时间戳。 That can be done by piping the raw data through awk, eg, 这可以通过将原始数据通过awk传递来完成，例如，

awk '/@client/ { ts = $0; sub("^.*@client.","",ts); sub("\..*$","",ts); if ( ts >= '$TS' - 60 and ts < '$TS' ) { print $0; } }'

where $TS is the value that you are looking for (a range makes more sense than equality). 其中$ TS是您要寻找的值（范围比相等更有意义）。

Actually the egrep is redundant (awk/mawk/gawk can do character classes unless you're using the obsolete version from Ubuntu). 实际上，egrep是多余的（awk / mawk / gawk可以执行字符类，除非您使用的是Ubuntu的过时版本）。 But it helps to break the process into stages to check that they work. 但这有助于将流程分为几个阶段，以检查它们是否有效。 In the awk script, 在awk脚本中，

it starts with a simple pattern /@client/ 它以简单的模式/ @ client /开头
I'm not certain strings will return this at the beginning of a line, but then 我不确定某些字符串会在一行的开头返回此值，但是
assign the line contents $0 to a variable which I can modify, 将行内容$ 0分配给我可以修改的变量，
trim off the part through "@client." 通过“ @client”修剪部分。
trim off the part beginning with "." 修剪以“。”开头的部分。 (is that milliseconds?) （是毫秒？）
compare the value to the $TS variable (passed in as part of the script, though another recent posting reminds us that awk's "-v" option would work too). 将值与$ TS变量进行比较（作为脚本的一部分传入，尽管最近的另一则帖子提醒我们awk的“ -v”选项也可以使用）。
if it passes the comparison, print the original line 如果通过比较，则打印原始行

As an aside, I'm aware that awk has a "-v" option, but since I generally build up scripts using the simplest tool which works first (such as sed), I generally do direct substitution by habit, saving "-v" for scripts passed as separate files. 顺便说一句，我知道awk有一个“ -v”选项，但是由于我通常使用最先工作的最简单的工具（例如sed）来构建脚本，因此我通常会按习惯直接替换，保存“ -v” ”作为单独文件传递的脚本。 I did (long ago) run into an awk which did not support "-v" -- see changelog ). 我确实（很久以前）遇到了不支持“ -v”的awk －请参阅changelog ）。 But we can take for granted that it is there. 但是我们可以认为它在那里是理所当然的。

Answer 2

You should try something with strings , it only keep printable ASCII characters from your file : 您应该尝试使用strings ，它只会保留文件中可打印的ASCII字符：

strings - 142490.1 |
  awk -F '.' -v timestamp="$(date +%s)" '/^@client/ && $2 < (timestamp - 60)*1000 {print}'

This awk script may be too specific to this example : it look at the field between the first and the second dot, and consider it's the timestamp. 这个awk脚本可能太具体于此示例：它查看第一个点和第二个点之间的字段，并认为它是时间戳。 If it's less than the current timestamp - 60 seconds, it print the line. 如果小于当前时间戳-60秒，则打印该行。

Hope it helped. 希望能有所帮助。

EDIT : As noted by Thomas Dickey (I'm new here, I don't know how to make a real reference to your account), you have to use the - flag on strings 编辑：如Thomas Dickey所述（我是新来的，我不知道如何真正引用您的帐户），您必须在strings上使用-标志

EDIT2 : After a few attempts, we reached a working version by adapting another answer from @ThomasDickey EDIT2：经过几次尝试，我们通过改编@ThomasDickey的另一个答案来达到工作版本

FILE=1424911080.1
strings - $FILE |
  awk -v fileTs="${FILE%.*}000" '/@client/ { ts = $0 ; sub("^.*@client\.","", ts); sub("\..*$","",ts); if ( ts - fileTs > 500 || ts - fileTs < -500 ) { print $0; } }'

Finally, to have the percentage of lines that have a timestamp difference > 500 : 最后，要获得时间戳差异大于500的行的百分比：

FILE=1424911080.1
tot=$(strings - "$FILE" | grep '@client' |wc -l)
old=$(strings - "$FILE" |
  awk -v fileTs="${FILE%.*}000" '/@client/ { ts = $0 ; sub("^.*@client\.","", ts); sub("\..*$","",ts); if ( ts - fileTs > 500 || ts - fileTs < -500 ) { print $0; } }' |
  wc -l)

echo "old : $(( old * 100 / tot ))%"

如何使用Shell脚本读取其中包含一些文本字符串的二进制文件？

问题描述

2 个解决方案

解决方案1
2 2015-02-26 00:16:53

解决方案2
2 已采纳 2015-02-26 00:16:58

如何使用Shell脚本读取其中包含一些文本字符串的二进制文件？

问题描述

2 个解决方案

解决方案1 2 2015-02-26 00:16:53

解决方案2 2 已采纳 2015-02-26 00:16:58

解决方案1
2 2015-02-26 00:16:53

解决方案2
2 已采纳 2015-02-26 00:16:58