简体   繁体   English

与 grep 匹配的模糊字符串

[英]fuzzy string matching with grep

I am trying to match rows in a file containing a string say ACTGGGTAAACTA .我正在尝试匹配包含ACTGGGTAAACTA字符串的文件中的行。 If I do如果我做

grep "ACTGGGTAAACTA" file 

It gives me rows which have exact matches.它给了我完全匹配的行。 Is there a way to allow for certain number of mismatches (substitutions, insertions or deletions)?有没有办法允许一定数量的不匹配(替换、插入或删除)? For example, I am looking for sequences例如,我正在寻找序列

  1. Up to 3 allowed subtitutions like "AGTGGGTAACCAA" etc.最多允许 3 个替换,例如“AGTGGGTAACCAA”等。

  2. Insertions/deletions (having a partial match like "ACTGGGAAAATAAACTA" or "ACTAAACTA")插入/删除(具有部分匹配,如“ACTGGGAAAATAAACTA”或“ACTAAACTA”)

There used to be a tool called agrep for fuzzy regex matching, but it got abandoned. 曾经有一个名为agrep的工具用于模糊正则表达式匹配,但它被放弃了。

http://en.wikipedia.org/wiki/Agrep has a bit of history and links to related tools. http://en.wikipedia.org/wiki/Agrep有一些历史记录和相关工具的链接。

https://github.com/Wikinaut/agrep looks like a revived open source release, but I have not tested it. https://github.com/Wikinaut/agrep看起来像一个复活的开源版本,但我还没有测试过它。

Failing that, see if you can find tre-agrep for your distro. 如果做不到这一点,看看你是否能为你的发行版找到tre-agrep

You can use tre-agrep and specify the edit distance with the -E switch. 您可以使用tre-agrep并使用-E开关指定编辑距离 For example if you have a file foo : 例如,如果你有一个文件foo

cat <<< EOF > foo
ACTGGGAAAATAAACTA
ACTAAACTA
ACTGGGTAAACTA
EOF

You can match every line with an edit distance of up to 9 like this: 您可以将每行编辑距离最多匹配为9,如下所示:

tre-agrep -s -9 -w ACTGGGTAAACTA foo

Output: 输出:

4:ACTGGGAAAATAAACTA
4:ACTAAACTA
0:ACTGGGTAAACTA

Short answer : no. 简答 :不。

Long answer : As @JDB said , regex is inherently precise. 答案很长 :正如@JDB所说 ,正则表达式本质上是精确的。 You can manually add in mismatches like [ATGC] instead of A in some spot, but there is no way to only allow a small amount of any mismatches. 你可以在某些地方手动添加像[ATGC]而不是A不匹配,但是没有办法只允许少量任何不匹配。 I suggest that you write your own code to parse it, or try to find a DNA parser somewhere. 我建议您编写自己的代码来解析它,或尝试在某处找到DNA解析器。

There's a Python library called fuzzysearch (that I wrote) which provides precisely the required functionality. 有一个名为fuzzysearch的Python库(我写过),它提供了所需的功能。

Here's some sample code that should work: 以下是一些应该有效的示例代码:

from fuzzysearch import find_near_matches

with open('path/to/file', 'r') as f:
    data = f.read()

# 1. search allowing up to 3 substitutions
matches = find_near_matches("ACTGGGTAAACTA", data, max_substitutions=3)

# 2. also allow insertions and deletions, i.e. allow an edit distance
#    a.k.a. Levenshtein distance of up to 3
matches = find_near_matches("ACTGGGTAAACTA", data, max_l_dist=3)

Remember the GNU/Linux philosophy , specifically the modularity concept, which enable us to handle small-but-powerful pieces independently.请记住GNU/Linux 哲学,特别是模块化概念,它使我们能够独立处理小而强大的部分。 We can gather a bunch of these small pieces to make magic.我们可以收集一堆这样的小碎片来制造魔法。 That is beauty of GNU/Linux这就是 GNU/Linux 的美妙之处

cat file | fzf --filter='ACTGGGTAAACTA'

check fzf here :)在这里检查fzf :)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM