简体   繁体   English

使用 awk 和 regexp 过滤列

[英]Filter column with awk and regexp

I've a pretty simple question.我有一个很简单的问题。 I've a file containing several columns and I want to filter them using awk.我有一个包含多列的文件,我想使用 awk 过滤它们。

So the column of interest is the 6th column and I want to find every string containing :所以感兴趣的列是第 6 列,我想找到每个包含以下内容的字符串:

  • starting with a number from 1 to 100以 1 到 100 的数字开头
  • after that one "S" or a "M"在那个“S”或“M”之后
  • again a number from 1 to 100又是一个从 1 到 100 的数字
  • after that one "S" or a "M"在那个“S”或“M”之后

So per example : 20S50M is ok所以每个例子:20S50M是可以的

I tried :我试过了 :

awk '{ if($6 == '/[1-100][S|M][1-100][S|M]/') print} file.txt

but it didn't work... What am I doing wrong?但它没有用......我做错了什么?

This should do the trick:这应该可以解决问题:

awk '$6~/^(([1-9]|[1-9][0-9]|100)[SM]){2}$/' file

Regexplanation:正则说明:

^                        # Match the start of the string
(([1-9]|[1-9][0-9]|100)  # Match a single digit 1-9 or double digit 10-99 or 100
[SM]                     # Character class matching the character S or M
){2}                     # Repeat everything in the parens twice
$                        # Match the end of the string

You have quite a few issue with your statement:你的陈述有很多问题:

awk '{ if($6 == '/[1-100][S|M][1-100][S|M]/') print} file.txt
  • == is the string comparision operator. ==是字符串比较运算符。 The regex comparision operator is ~ .正则表达式比较运算符是~
  • You don't quote regex strings (you never quote anything with single quotes in awk beside the script itself) and your script is missing the final (legal) single quote.您不引用正则表达式字符串(您永远不会在awk除了脚本本身之外用单引号引用任何内容)并且您的脚本缺少最终(合法)单引号。
  • [0-9] is the character class for the digit characters, it's not a numeric range. [0-9]数字字符的字符类,它不是数字范围。 It means match against any character in the class 0,1,2,3,4,5,6,7,8,9 not any numerical value inside the range so [1-100] is not the regular expression for digits in the numerical range 1 - 100 it would match either a 1 or a 0.这意味着匹配0,1,2,3,4,5,6,7,8,9类中的任何字符而不是范围内的任何数值,因此[1-100]不是数字的正则表达式数字范围 1 - 100 它将匹配 1 或 0。
  • [SM] is equivalent to (S|M) what you tried [S|M] is the same as (S|\\||M) . [SM]相当于(S|M)您尝试过的[S|M](S|\\||M) You don't need the OR operator in a character class.在字符类中不需要 OR 运算符。

Awk using the following structure condition{action} . awk 使用以下结构condition{action} If the condition is True the actions in the following block {} get executed for the current record being read.如果条件为 True,则为正在读取的当前记录执行以下块{}的操作。 The condition in my solution is $6~/^(([1-9]|[1-9][0-9]|100)[SM]){2}$/ which can be read as does the sixth column match the regular expression, if True the line gets printed because if you don't get any actions then awk will execute {print $0} by default.我的解决方案中的条件是$6~/^(([1-9]|[1-9][0-9]|100)[SM]){2}$/可以像第六列匹配一样读取正则表达式,如果 True 该行被打印,因为如果你没有得到任何操作,那么awk将默认执行{print $0}

Regexes cannot check for numeric values.正则表达式无法检查数值。 "A number from 1 to 100" is outside what regexes can do. “从 1 到 100 的数字”超出了正则表达式所能做的范围。 What you can do is check for "1-3 digits."您可以做的是检查“1-3 位数”。

You want something like this你想要这样的东西

/\d{1,3}[SM]\d{1,3}[SM]/

Note that the character class [SM] doesn't have the !请注意,字符类[SM]没有! alternation character.交替字符。 You would only need that if you were writing it as (S|M) .如果您将其写为(S|M) ,则只需要它。

I would do the regex check and the numeric validation as different steps.我会将正则表达式检查和数字验证作为不同的步骤进行。 This code works with GNU awk:此代码适用于 GNU awk:

$ cat data
a b c d e 132x123y
a b c d e 123S12M
a b c d e 12S23M
a b c d e 12S23Mx

We'd expect only the 3rd line to pass validation我们希望只有第三行通过验证

$ gawk '
    match($6, /^([[:digit:]]{1,3})[SM]([[:digit:]]{1,3})[SM]$/, m) && 
    1 <= m[1] && m[1] <= 100 && 
    1 <= m[2] && m[2] <= 100 {
        print
    }
' data
a b c d e 12S23M

For maintainability, you could encapsulate that into a function:为了可维护性,您可以将其封装到一个函数中:

gawk '
    function validate6() {
        return( match($6, /^([[:digit:]]{1,3})[SM]([[:digit:]]{1,3})[SM]$/, m) && 
                1<=m[1] && m[1]<=100 && 
                1<=m[2] && m[2]<=100 );
    }
    validate6() {print}
' data

The way to write the script you posted:编写您发布的脚本的方法:

awk '{ if($6 == '/[1-100][S|M][1-100][S|M]/') print} file.txt

in awk so it will do what you SEEM to be trying to do is:在 awk 中,它会做你想做的事情:

awk '$6 ~ /^(([1-9][0-9]?|100)[SM]){2}$/' file.txt

Post some sample input and expected output to help us help you more.发布一些示例输入和预期输出,以帮助我们为您提供更多帮助。

Try this:试试这个:

awk '$6 ~/^([1-9]|0[1-9]|[1-9][0-9]|100)+[S|M]+([1-9]|0[1-9]|[1-9][0-9]|100)+[S|M]$/' file.txt

Because you did not say exactly how the formatting will be in column 6, the above will work where the column looks like '03M05S', '40S100M', or '3M5S';由于您没有确切说明第 6 列中的格式设置方式,因此上述内容适用于列看起来像“03M05S”、“40S100M”或“3M5S”的情况; and exclude all else.并排除所有其他内容。 For instance, it will not find '03F05S', '200M05S', '03M005S, 003M05S, or '003M005S'.例如,它不会找到“03F05S”、“200M05S”、“03M005S”、“003M05S”或“003M005S”。

If you can keep the digits in column 6 to two when 0-99, or three when exactly 100 - meaning exactly one leading zero when under 10, and no leading zeros otherwise, then it is a simpler match.如果您可以在 0-99 时将第 6 列中的数字保留为两位,或者在正好为 100 时保留三位 - 意味着在 10 以下时恰好有一个前导零,否则没有前导零,那么这是一个更简单的匹配。 You can use the above pattern but exclude single digits (remove the first [1-9] condition), eg您可以使用上述模式但排除单个数字(删除第一个 [1-9] 条件),例如

awk '$6 ~/^(0[1-9]|[1-9][0-9]|100)+[S|M]+(0[1-9]|[1-9][0-9]|100)+[S|M]$/' file.txt

I know this thread has already been answered, but I actually have a similar problem (relating to finding strings that "consume query").我知道这个线程已经得到了回答,但我实际上有一个类似的问题(与查找“使用查询”的字符串有关)。 I'm trying to sum up all of the integers preceding a character like 'S', 'M', 'I', '=', 'X', 'H', as to find the read length via a paired-end read's CIGAR string.我试图总结像“S”、“M”、“I”、“=”、“X”、“H”这样的字符前面的所有整数,以通过配对末端找到读取长度阅读 CIGAR 字符串。

I wrote a Python script that takes in the column $6 from a SAM/BAM file:我编写了一个 Python 脚本,它从 SAM/BAM 文件中获取 $6 列:

import sys                      # getting standard input
import re                       # regular expression module

lines = sys.stdin.readlines()   # gets all CIGAR strings for each paired-end read
total = 0
read_id = 1                     # complements id from filter_1.txt

# Get an int array of all the ints matching the pattern 101M, 1S, 70X, etc.
# Example inputs and outputs: 
# "49M1S" produces total=50
# "10M757N40M" produces total=50

for line in lines:
    all_ints = map(int, re.findall(r'(\d+)[SMI=XH]', line))
    for n in all_ints:
        total += n
    print(str(read_id)+ ' ' + str(total))
    read_id += 1
    total = 0

The purpose of the read_id is to mark each read you're going through as "unique", in case if you want to take the read_lengths and print them beside awk-ed columns from a BAM file. read_id 的目的是将您正在经历的每个读取标记为“唯一”,以防万一您想要获取 read_lengths 并将它们打印在 BAM 文件中的 awk-ed 列旁边。

I hope this helps, or at least helps the next user that has a similar issue.我希望这会有所帮助,或者至少可以帮助下一个遇到类似问题的用户。 I consulted https://stackoverflow.com/a/11339230 for reference.我查阅了https://stackoverflow.com/a/11339230以供参考。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM