简体   繁体   English

如何从awk或sed的行中提取单引号中的数字?

[英]how to extract number in a single quote from a line with awk or sed?

I have this line, tab delimited: 我有这行,以制表符分隔:

chr1    11460   11462   '16/38' 421     +       chr1    11460   11462   '21/29' 724     +       2
chr1    11479   11481   '11/29' 379     +       chr1    11479   11481   '20/5' 667     +       2

What I want to do is to test if all the second number inside ' ' are greater or equal to 10. If so, I'll output this line. 我要做的是测试''中的所有第二个数字是否都大于或等于10。如果是,我将输出此行。 So the result should be to print the first line 所以结果应该是打印第一行

chr1    11460   11462   '16/38' 421     +       chr1    11460   11462   '21/29' 724     +       2

I can write a perl code to do it. 我可以编写一个perl代码来做到这一点。 But this seems to be something awk can do easily.. anyone has a solution? 但这似乎是awk可以轻松完成的事情。任何人都有解决方案吗?

Thanks. 谢谢。

如果设置正确的字段分隔符,则非常简单:

awk -F "['/]" '{for (i=3; i<=NF; i+=3) if ($i<10) next; print}' file

Easiest way fetch the content inside single quotes might be just to strip off everything from both ends of each line, up to and including the single quote: 获取单引号内内容的最简单方法可能只是剥离每一行两端(包括单引号在内)的所有内容:

$ sed "s/^[^']*'//;s/'.*//" file
16/38
11/29

This sed expression consists of two commands: 这个sed表达式包含两个命令:

  • s/^[^']*'// -- strips off all text to the first single quote, s/^[^']*'// -将所有文本剥离为第一个单引号,
  • s/'.*// -- strips off all text from the first (remaining) single quote to EOL. s/'.*//将第一个(剩余的)单引号中的所有文本剥离到EOL。

To wrap this in a shell script that does something with that data requires .. well, a shell script... 要将其包装在对数据执行某些操作的shell脚本中,需要..嗯,一个shell脚本...

You can parse this stuff using bash's read command. 您可以使用bash的read命令来解析这些内容。 For example: 例如:

#!/bin/bash
IFS=/
sed "s/^[^']*'//;s/'.*//" file \
| while read left right; do
  echo "$left / $right"
done

To implement something that grabs contents of multiple single-quoted numbers, you can expand the sed script appropriately, and implement if statements for the conditions you want. 要实现可捕获多个单引号内容的内容,可以适当地扩展sed脚本,并针对所需条件实现if语句。 For example, a sed expression to grab the TWO single-quoted strings might be: 例如,用于捕获两个单引号字符串的sed表达式可能是:

sed "s/^[^']*'\([^']*\)'[^']*'\([^']*\)'.*/\1 \2/"

This is a single large regex that uses two sets of brackets \\( and \\) , to mark patterns that will be placed in the output, \\1 and \\2 . 这是一个大型正则表达式,它使用两组括号\\(\\)来标记将放置在输出中的模式\\1\\2

But you might be better off parsing things according to column position: 但是您最好根据列位置解析事物:

$ while read _ _ _ A _ _ _ _ _ B _; do echo "$A .. $B"; done < file
'16/38' .. '21/29'
'11/29' .. '20/5'

Actually implementing your programming logic is left as an exercise to the reader. 实际执行编程逻辑留给读者练习。 If you'd like us to help you with your script, please include your work so far. 如果您希望我们帮助您编写脚本,请包括到目前为止的工作。

As long as those are the only ' characters in the string and the numbers won't have leading zeros you could use the regular expression: 只要这些是字符串中唯一的'字符,并且数字不带前导零,则可以使用正则表达式:

\d\d+'.*\d\d+'

If either of those preconditions isn't true there are changes that could be made, but it would depend on the situation. 如果这些先决条件中的任何一个都不成立,则可以进行更改,但要视情况而定。

You should be able to use grep to get the lines you want using that regex. 您应该能够使用grep来获取要使用该正则表达式的行。 The following puts just the first line to stdout: 以下内容仅将第一行放入stdout:

grep \d\d+'.*\d\d+' "chr1    11460   11462   '16/38' 421     +       chr1    11460   11462   '21/29' 724     +       2
chr1    11479   11481   '11/29' 379     +       chr1    11479   11481   '20/5' 667     +       2"

My version, serious overkill but should work with any amount of 'xx/xx' per line: 我的版本,严重过大,但是每行可以使用任意数量的“ xx / xx”:

awk -F'\t' "{
    found=1;
    for(i=0;i<NF;i++){
        if(match(\$i, /'[[:digit:]]+\/([[:digit:]]+)'/, capts)){
            if(capts[1] < 10){
                found=0;
                break;
            }
        }
    }
    if(found){
        print;
    }
}" file.txt

Explanation: 说明:

This will loop through each field of the line and apply a regex against the field to find the last digits of 'xx/xx'. 这将遍历该行的每个字段,并对该字段应用正则表达式以查找“ xx / xx”的最后一位。 If the last digits are less than 10 it will break out of the loop and go to the next line. 如果最后一位少于10,它将跳出循环并转到下一行。 If all fields have been processed by the if loop and no last digits were less than 10, it will print the line. 如果if循环已经处理了所有字段,并且最后一位数字都不小于10,它将打印该行。

Note: 注意:

Seeing that i'm using the match function to capture regex groups this will only work with GNU awk. 看到我正在使用match函数捕获正则表达式组,这仅适用于GNU awk。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM