简体   繁体   English

当 Linux 中没有明确的模式时,提取 substring

[英]Extract a substring when there is no clear pattern in Linux

I am not super advanced in coding and have been struggling with this problem.我在编码方面并不是超级先进,并且一直在努力解决这个问题。 I need to extract a substring from a.txt file but there is no clear pattern for me to be able to use awk or cut commands.我需要从 a.txt 文件中提取 substring ,但没有明确的模式让我能够使用 awk 或剪切命令。 I need to extract the value for AF in each line in the picture below (circled in blue), however, the number of characters for this string varies from line to line, and the location of the string changes from line to line as well.我需要在下图中的每一行中提取 AF 的值(以蓝色圈出),但是,该字符串的字符数因行而异,并且字符串的位置也因行而异。 I tried using grep but it is only returning "AF=", not the number values that follow.我尝试使用 grep 但它只返回“AF =”,而不是后面的数值。 I also thought about using the re.findall command in python but the python environment that I have in Ubuntu isn't letting me use it.我还考虑过在 python 中使用 re.findall 命令,但是我在 Ubuntu 中的 python 环境不允许我使用它。 enter image description here在此处输入图像描述

I would greatly appreciate any guidance, thank you!!!非常感谢任何指导,谢谢!!!

Since the example text is not provided as text but as image, here is my own example text (generated by me, by randomly tapping keyboard):由于示例文本不是作为文本而是作为图像提供的,所以这是我自己的示例文本(由我生成,通过随机敲击键盘):

AF=32435.42235;dw=234;324f3rg;3frg4;3gr4w;g4rw5
j6u;5ju65e;t42r;g5b5;AF=32.43542235;dw=234;324f3rg;3frg4;3gr4w;g4rw5
3f4gh5y4bt4h5;g4;3h;4j64g;y;AF=32435.42235;dw=234;324f3rg;3frg4;3gr4w;g4rw5

What I noticed is that it's like table, with each fields separated with semicolon (;), and value is defined with KEY=VALUE我注意到它就像表格一样,每个字段用分号(;)分隔,值用 KEY=VALUE 定义

To just get value of AF field, you can use grep with such pattern: AF=[0-9.]+要获取 AF 字段的值,您可以使用 grep 与以下模式: AF=[0-9.]+

Explanation: [0-9.] will match character 0123456789., and + will match if it occurs once or more解释: [0-9.]将匹配字符 0123456789.,如果出现一次或多次, +将匹配

Here is example terminal output:这是示例终端 output:

$ cat /tmp/a
AF=32435.42235;dw=234;324f3rg;3frg4;3gr4w;g4rw5
j6u;5ju65e;t42r;g5b5;AF=32.43542235;dw=234;324f3rg;3frg4;3gr4w;g4rw5
3f4gh5y4bt4h5;g4;3h;4j64g;y;AF=32435.42235;dw=234;324f3rg;3frg4;3gr4w;g4rw5

$ grep -o -E 'AF=[0-9.]+' /tmp/a
AF=32435.42235
AF=32.43542235
AF=32435.42235

Now if you want only the numbers (without the AF= prefix), you can just pipe it to other grep command like such:现在,如果您只想要数字(没有AF=前缀),您可以只将 pipe 传递给其他 grep 命令,如下所示:

$ grep -o -E 'AF=[0-9.]+' /tmp/a | grep -o -E '[0-9.]+'
32435.42235
32.43542235
32435.42235

Grep flag explanation: -E enables extended regular expression, -o only output match instead of whole line Grep 标志说明: -E启用扩展正则表达式, -o仅匹配 output 而不是整行匹配

You can use grep to match everything from AF= up to but not including the first semicolon:您可以使用 grep 匹配从 AF= 到但不包括第一个分号的所有内容:

grep -o 'AF=[^;]*'

To guard against spurious matches when AF= appears elsewhere in a line, the following will match only when AF= begins on a word boundary:为了防止 AF= 出现在一行中的其他位置时出现虚假匹配,以下内容仅在 AF= 开始于单词边界时匹配:

grep -o '\bAF=[^;]*'

Grep should be the best way to do it, but here is an awk Grep 应该是最好的方法,但这里有一个awk

echo "test;AF=342435.34234;yes=3434" | awk -F'AF=' '{split($2,a,";");print FS a[1]}'
AF=342435.34234

It finds the AF= tag, then take rest of the text unn til ;它找到AF=标签,然后取文本 unn til 的 rest ;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM