[英]How to read a binary file which has some text string in it in using shell script?
I have a file whose name is 142490.1
and that file will have content like this - 我有一个文件名为
142490.1
,该文件将包含以下内容-
^A^A^@^@^@=^@^@=y^B^@e^A^C^@f^B^H¬^\ÂA^Y^A^G^B<81>s
^A^@G@client.1424906160996.30431.DC1.5faa5c2a-c382-40b8-baa8-234a8e6ecd19^@^@^A^F<8b>f^@ø^@y^@^@^AKÃ^F<86>T^@^@^@êõ^A\^@^R304344351^N2047675^@^D77^@^Y^W^B^@
27.99^@^X261449949761^@Ã^O^@<92>^NICHOLSON Baseball ^V|t -S M L XL XXL(2)^@
15724^@
63862^U^GðV11450^@^B7^@<9a>^A^@^L823196^@¨<99>´°øR^B^@^TBj%2FRZUw*^@^PBoZf8jU*^@^T1032869222^B^@&LH_DefaultDomain_77^@^@^A^@^@H@client.1424906160992.116975.DC1.344073e8-93f6-487c-b343-7923080f07aa^@^@^AKÃ^F<8b>f^@Â^@y^@^@^AKÃ^Eò<9f>£^AX^@^T1169755138^N2047935^@^B3.^W^@ð^?^B^@^H0.99^@^X171689807229^B^@rTOPSHOP LEATHER 3 EU 36^B^B^@
45333^B^B^@^F^@^L161103^@ðï°øR^B^B^@^PBosZQlE*^B^B^B^@^@^A^@^@G@client.1424906160976.1295684.DC1.66a6ca77-30ee-4d50-b7ea-4a524eb94af1^@^@^AKÃ^F<8b>f^@¤^@y^@^@^AKÃ^F<89>^O^@^@^@<96><9a>^AT^@^R129569484^N2047935^@^B3^]^V^B^@^F499^853759648^B^@bWILLIS AND^B^B^@
20489^B^B^@^F^@^P-1404420^@<9e>¤´°øR^B^B^@^PBop4ml0*^B^B^B^@^@^A^@^@H@client.1424906160989.104826.DC1.4d58c06a-3526-408a-a48b-8bdc82b94dba^@^@^AKÃ^F<8b>f^@¨^@R^@^@^AKÃ^F<83>¶^@^@^@<9a>·^AX^@^T1048328026^N2045573^@^B0.^W^@^P^B^B^^Að@^@^H6000^@^Z1955 corvette^@ì<8e>´°øR^B^@^PBiZzFm8*^@^PBoO8YKc*^@^@^A^@
I know above file content looks mainly binary but there are some strings in the file which we can read it clearly. 我知道上面的文件内容主要是二进制文件,但是文件中有一些字符串,我们可以清楚地读取它。
If you see the above file content, you will see a string like this - 如果您看到上述文件内容,则会看到类似这样的字符串-
@client.1424906160996.30431.DC1.5faa5c2a-c382-40b8-baa8-234a8e6ecd19
In the above string 1424906160996
is a timestamp. 在上面的字符串
1424906160996
是一个时间戳。
ProblemStatement: 问题陈述:
I need to find all the strings which starts with @client
and whose timestamp is one minute old as compared to current timestamp. 我需要找到所有以
@client
并且时间戳比当前时间戳@client
一分钟的字符串。
Let's say if below are the strings which starts with @client
and whose timestamp is one minute older as compared to current timestamp, then it should print out like this after reading the file - 假设下面是以
@client
且时间戳比当前时间戳大一分钟的字符串,则在读取文件后应将其打印出来-
@client.1424906161996.3031.DC1.5faaa-c382-40b8-baa8-234a8ed19
@client.1424906162996.3041.DC1.5a5c2a-c382-40b8-baa8-238e6ec9
@client.1424906163996.3043231.DC1.5faa2a-c382-40b8-baa8-23e6ed19
@client.1424906164996.3016731.DC1.5faa5a-c382-40b8-baa8-234ad19
Is there any way to do this using shell script which can read the above file and print out those strings which starts with @client
and whose timestamp is older than 1 minute. 有什么方法可以使用Shell脚本来读取上面的文件并打印出以
@client
且时间戳早于1分钟的字符串。
I have Ubuntu 12.04 running. 我正在运行Ubuntu 12.04。
The simplest way to extract the data is by using the strings utility, telling it to scan the whole file, eg, 提取数据的最简单方法是使用字符串实用程序,告诉它扫描整个文件,例如,
strings - inputfile | egrep '@client(\.[[:xdigit:]]+)+(-[[:xdigit:]]+)+'
but as noted in the other example, there is still the timestamp to consider. 但是,如另一个示例中所述,仍然需要考虑时间戳。 That can be done by piping the raw data through awk, eg,
这可以通过将原始数据通过awk传递来完成,例如,
awk '/@client/ { ts = $0; sub("^.*@client.","",ts); sub("\..*$","",ts); if ( ts >= '$TS' - 60 and ts < '$TS' ) { print $0; } }'
where $TS is the value that you are looking for (a range makes more sense than equality). 其中$ TS是您要寻找的值(范围比相等更有意义)。
Actually the egrep is redundant (awk/mawk/gawk can do character classes unless you're using the obsolete version from Ubuntu). 实际上,egrep是多余的(awk / mawk / gawk可以执行字符类,除非您使用的是Ubuntu的过时版本)。 But it helps to break the process into stages to check that they work.
但这有助于将流程分为几个阶段,以检查它们是否有效。 In the awk script,
在awk脚本中,
As an aside, I'm aware that awk has a "-v" option, but since I generally build up scripts using the simplest tool which works first (such as sed), I generally do direct substitution by habit, saving "-v" for scripts passed as separate files. 顺便说一句,我知道awk有一个“ -v”选项,但是由于我通常使用最先工作的最简单的工具(例如sed)来构建脚本,因此我通常会按习惯直接替换,保存“ -v” ”作为单独文件传递的脚本。 I did (long ago) run into an awk which did not support "-v" -- see changelog ).
我确实(很久以前)遇到了不支持“ -v”的awk -请参阅changelog )。 But we can take for granted that it is there.
但是我们可以认为它在那里是理所当然的。
You should try something with strings
, it only keep printable ASCII characters from your file : 您应该尝试使用
strings
,它只会保留文件中可打印的ASCII字符:
strings - 142490.1 |
awk -F '.' -v timestamp="$(date +%s)" '/^@client/ && $2 < (timestamp - 60)*1000 {print}'
This awk script may be too specific to this example : it look at the field between the first and the second dot, and consider it's the timestamp. 这个awk脚本可能太具体于此示例:它查看第一个点和第二个点之间的字段,并认为它是时间戳。 If it's less than the current timestamp - 60 seconds, it print the line.
如果小于当前时间戳-60秒,则打印该行。
Hope it helped. 希望能有所帮助。
EDIT : As noted by Thomas Dickey (I'm new here, I don't know how to make a real reference to your account), you have to use the -
flag on strings
编辑:如Thomas Dickey所述(我是新来的,我不知道如何真正引用您的帐户),您必须在
strings
上使用-
标志
EDIT2 : After a few attempts, we reached a working version by adapting another answer from @ThomasDickey EDIT2:经过几次尝试,我们通过改编@ThomasDickey的另一个答案来达到工作版本
FILE=1424911080.1
strings - $FILE |
awk -v fileTs="${FILE%.*}000" '/@client/ { ts = $0 ; sub("^.*@client\.","", ts); sub("\..*$","",ts); if ( ts - fileTs > 500 || ts - fileTs < -500 ) { print $0; } }'
Finally, to have the percentage of lines that have a timestamp difference > 500 : 最后,要获得时间戳差异大于500的行的百分比:
FILE=1424911080.1
tot=$(strings - "$FILE" | grep '@client' |wc -l)
old=$(strings - "$FILE" |
awk -v fileTs="${FILE%.*}000" '/@client/ { ts = $0 ; sub("^.*@client\.","", ts); sub("\..*$","",ts); if ( ts - fileTs > 500 || ts - fileTs < -500 ) { print $0; } }' |
wc -l)
echo "old : $(( old * 100 / tot ))%"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.