简体   繁体   English

使用 grep 搜索文件并且只输出部分行

[英]searching files with grep and only outputting parts of lines

I'm looking though log files and trying to get a less cluttered output in my end file.我正在查看日志文件并试图在我的最终文件中获得不那么混乱的输出。 If I grep for a value I want to then format the output to remove anything but the date and the url.如果我 grep 一个值,我想格式化输出以删除除日期和 url 之外的任何内容。

For example here is a line of the file.例如这里是文件的一行。

Sep 25 08:07:51 10.20.30.40 FF_STUFF[]: 1545324890 1 55.44.33.22 10.9.8.7 - 10.60.154.41 http://website.com 0 BYF ALLOWED CLEAN 2 1 0 0 0 (-) 0 - 0 - 0 sqm.microsoft.com - [-] sqm.microsoft.com - - 0 9 月 25 日 08:07:51 10.20.30.40 FF_STUFF[]: 1545324890 1 55.44.33.22 10.9.8.7 - 10.60.154.41 http://website.com 0 BYF 0 - 0 0 0 允许 0 sqm.microsoft.com - [-] sqm.microsoft.com - - 0

I want to do a grep, or a better command if necessary, to output to a .txt file with only the bold entries listed.我想做一个 grep 或必要时更好的命令,以输出到仅列出粗体条目的 .txt 文件。 Basically list date and URL.基本上列出日期和 URL。 So how do I tell it to list the first 15 characters including spaces, then find the first http/https and list everything until the first empty space?那么我如何告诉它列出包括空格在内的前 15 个字符,然后找到第一个 http/https 并列出所有内容直到第一个空格? Each line is not the same length or anything of that nature so I can not just go by character position.每行的长度或任何性质都不相同,所以我不能只按字符位置。

So my output would be所以我的输出将是

Sep 25 08:07:51 http://website.com 9 月 25 日 08:07:51 http://website.com

Thank you.谢谢你。

You can't easily use -o option in grep because you have two patterns, separated by a variable number of characters (and -o will print the complete matched part).您不能在grep轻松使用-o选项,因为您有两个模式,由可变数量的字符分隔(并且-o将打印完整的匹配部分)。

If you wanted to extract only URLs, this would suffice:如果您只想提取 URL,这就足够了:

$ grep -oE 'https?:[^ ]+' file
http://website.com

But to extract both the date and the URL, probably the simplest solution is with GNU awk :但是要提取日期和 URL,可能最简单的解决方案是使用GNU awk

$ awk '{ match($0, /https?:[^ ]+/, url); print $1, $2, $3, url[0]; }' file
Sep 25 08:07:51 http://website.com

where you print first three fields ( $1 to $3 , space-separated), then search for a URL with match() (assuming it contains no spaces, ie that space characters are always properly escaped; either as + , or %20 ), and then print the first URL found (after the date).打印前三个字段( $1$3 ,空格分隔),然后使用match()搜索 URL(假设它不包含空格,即空格字符始终正确转义;作为+%20 ),然后打印找到的第一个 URL(日期之后)。

In case you have POSIX awk (or call gawk with --posix flag), the solution is a little bit more verbose, since POSIX match() doesn't support saving of the matched parts into an array (third argument, url ) and you'll have to explicitly extract URL with substr() when a match is found:如果您有POSIX awk (或使用--posix标志调用gawk ),则解决方案会稍微冗长一些,因为 POSIX match()不支持将匹配的部分保存到数组中(第三个参数, url )和找到匹配项时,您必须使用substr()显式提取 URL:

$ awk '{ match($0, /https?:[^ ]+/); print $1, $2, $3, substr($0, RSTART, RLENGTH); }' file
Sep 25 08:07:51 http://website.com

To supplement @randomir's answer, we can also use sed :为了补充@randomir 的回答,我们还可以使用sed

$ sed 's/\(.\{15\}\).*\(https\?:\/\/[^ ]\+\).*/\1 \2/' < input.txt > output.txt

This pattern assumes that the first 15 characters composes the date and that the URL contains no spaces.此模式假定前 15 个字符组成日期并且 URL 不包含空格。 It works for both http and https URLs.它适用于httphttps URL。


Edit - to address the comment—for the sake of learning, we can also invoke sed to perform line-matching operations like grep :编辑- 解决注释 - 为了学习,我们还可以调用sed来执行行匹配操作,如grep

sed -n '/10\.45\.19\.151/p' < input.txt

...will output any lines in input.txt that contain the IP address 10.45.19.151 . ...将输出input.txt中包含 IP 地址10.45.19.151任何行。 The -n option suppresses output of every line. -n选项抑制每一行的输出。 We combine this option with the p command to print only lines that match the pattern.我们将此选项与p命令结合使用以仅打印与模式匹配的行。

We can merge this approach with the first command to "grep" for lines and transform them using a single command:我们可以将此方法与第一个命令合并以“grep”行使用单个命令转换它们:

sed -n '/<line-match-pattern>/ s/<...>/<...>/ p' < input.txt

...will select only the lines that match <line-match-pattern> , perform the substitution, and output the result. ...将仅选择匹配<line-match-pattern> ,执行替换并输出结果。 To illustrate, here's an example using the information provided in the comment:为了说明,这里有一个使用评论中提供的信息的示例:

sed -n '/10\.45\.19\.151/ s/\(.\{15\}\).*\(https\?:\/\/[^ ]\+\).*/\1 \2/ p' \
    < messages-20171001 \
    > /backup/mikesanders-fwlog-10012017.txt
awk '{match($0,/http[^com]*/);print $1,$2,$3,substr($0,RSTART,RLENGTH+3)}'  Input_file

Explanation of above code:以上代码说明:

awk '{
match($0,/http[^com]*/);                  ##Using match default utility of awk where I am searching for regex where it will look for string http till string com comes.
print $1,$2,$3,substr($0,RSTART,RLENGTH+3)##Now printing the 1st, 2nd and 3rd column which are date and time in current line and printing sub string of current line where it should start substring from the value of RSTART till value of RLENGTH(which will be http complete URL actually). Now point to be noted here variables RSTART and RLENGTH are default variables of awk which will be set once a regex match is found in match utility of awk.
}
' Input_file                              ##Mentioning the Input_file name here.

You can use grep -o to match each of the line sections that you want, then reassemble the lines that grep returns:您可以使用grep -o匹配您想要的每个行部分,然后重新组合 grep 返回的行:

$ grep -Eo '^.{15}|https?://[^ ]+' f | paste - -
Sep 25 08:07:51 http://website.com

Note that in FreeBSD or OSX, the old version of GNU grep they use (2.5.1) is buggy, so more explicit date recognition is in order:请注意,在 FreeBSD 或 OSX 中,他们使用的旧版本 GNU grep (2.5.1) 有问题,因此需要更明确的日期识别:

$ grep -Eo '[A-Z][a-z]{2} ([0-9]{2}[ :]){3}[0-9]{2}|https?://[^ ]+' f | paste - -
Sep 25 08:07:51 http://website.com

A workaround in FreeBSD is to use bsdgrep , which is functionally equivalent to gnu grep but without the bugs. FreeBSD 中的一种解决方法是使用bsdgrep ,它在功能上等同于 gnu grep 但没有错误。 In MacOS, one might need to install an alternative using homebrew or macports .. or just use the POSIX awk solution in another answer.在 MacOS 中,可能需要使用 homebrew 或 macports 安装替代方案 .. 或者只是在另一个答案中使用 POSIX awk 解决方案。

Anyway, in both cases, the regular expression consists of two expressions joined with an or-bar ( | , before https ).无论如何,在这两种情况下,正则表达式都由两个用 or-bar( | ,在https之前)连接的表达式组成。 The first subexpression matches your dates, the second one matches your URLs.第一个子表达式匹配您的日期,第二个子表达式匹配您的 URL。

As long as every line of input contains text that match both of these elements, you should get two lines of output from grep for each log entry.只要输入的每一行都包含匹配这两个元素的文本,您应该从 grep 为每个日志条目获得两行输出。 Then paste reassembles them into a single line.然后paste将它们重新组合成一行。

Just 1 command line like:只有 1 个命令行,如:

msr -p my.log -t "^(.*?\\d+:\\d+:\\d+).*?(https?://\\S+).*" -o '$1 $2' -PIC > output.txt

  • If first 15 characters is more reliable than pattern "^(.*?\\d+:\\d+:\\d+)" :如果first 15 characters比模式"^(.*?\\d+:\\d+:\\d+)"更可靠:

    Use "^(.{15})" like: -t "^(.{15}).*?(https?://\\S+).*"使用"^(.{15})"例如: -t "^(.{15}).*?(https?://\\S+).*"

  • If you want to further filter like containing one ip 10.9.8.7 as a plaint-text( -x ):如果你想进一步过滤,比如包含一个 ip 10.9.8.7作为纯文本( -x ):

    msr -p my.log -x 10.9.8.7 -t "^(.*?\\d+:\\d+:\\d+).*?(https?://\\S+).*" -o '$1 $2'

  • If must contain more IPs like 10.9.8.7 10.9.8.8 10.9.8.9 , or further processing:如果必须包含更多 IP,如10.9.8.7 10.9.8.8 10.9.8.9 ,或进一步处理:

    msr -p my.log -t "^(.*?\\d+:\\d+:\\d+).*?(https?://\\S+).*" -o '$1 $2' -PAC | msr -t "10\\.9\\.8\\.[7-9]" -PAC > output.txt

msr.exe / msr.gcc* is a single exe tool for such ETL alike work (Load -> Extract -> Transform or Replace file) in my open project , about 1.6MB, no dependencies, with cross platform versions plus x86 / x64 versions. msr.exe / msr.gcc*我打开的项目中用于此类 ETL 类似工作(加载 -> 提取 -> 转换或替换文件)的单个 exe 工具,大约 1.6MB,无依赖项,具有跨平台版本以及x86 / x64版本。

  • Load files recursively ( -r ) and filter directory name, file name, time, size like:递归加载文件 ( -r ) 并过滤目录名、文件名、时间、大小,例如:

    -r -p dir1,dirN,file1,fileN -f "\\.(log|txt)$" --w1 2017-09-25 and --nf "excluded-files" --nd "excluded-directories" , --s1 1.5MB --s2 30MB , --w2 "2017-09-30 22:30:50" etc. -r -p dir1,dirN,file1,fileN -f "\\.(log|txt)$" --w1 2017-09-25--nf "excluded-files" --nd "excluded-directories" , --s1 1.5MB --s2 30MB , --w2 "2017-09-30 22:30:50"

  • Extract by general Regex unlike sed or awk , exactly same as C++ / C# / Java / Scala /etc.:sedawk不同,通过常规Regex提取,与C++ / C# / Java / Scala / 等完全相同:

    -t "^(.*?\\d+:\\d+:\\d+).*?(https?://\\S+).*" ignore case: add -i like: -i -t or -it -t "^(.*?\\d+:\\d+:\\d+).*?(https?://\\S+).*"忽略大小写:添加-i like: -i -t-it

  • Transform output like:转换输出,如:

    • -o '$1 $2' for Linux or Cygwin / Powershell on Windows. -o '$1 $2'适用于 Linux 或 Windows 上的Cygwin / Powershell
    • -o "$1 $2" for Windows CMD console window or *.bat / *.cmd files. -o "$1 $2"用于 Windows CMD console window*.bat / *.cmd文件。

See following screenshot:请参阅以下屏幕截图: 提取日志并转换输出

If you're on Linux, you can just run msr.gcc48 or msr-i386.gcc48 it's 32-bit machine.如果你在 Linux 上,你可以运行msr.gcc48msr-i386.gcc48它是 32 位机器。 Just run the exe you'll get the usages and examples, or see online docs about performance comparison (with Linux system tool grep and Windows system tool findstr ), built-in docs like:msr on CentOS , colorful vivid demo on Windows .直接运行exe就可以看到用法和例子,也可以看网上的性能对比文档(Linux系统工具grep和Windows系统工具findstr ),内置文档如:CentOS上的msrWindows上的彩色生动演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM