[英]fuzzy string matching with grep
I am trying to match rows in a file containing a string say ACTGGGTAAACTA
.我正在尝试匹配包含ACTGGGTAAACTA
字符串的文件中的行。 If I do如果我做
grep "ACTGGGTAAACTA" file
It gives me rows which have exact matches.它给了我完全匹配的行。 Is there a way to allow for certain number of mismatches (substitutions, insertions or deletions)?有没有办法允许一定数量的不匹配(替换、插入或删除)? For example, I am looking for sequences例如,我正在寻找序列
Up to 3 allowed subtitutions like "AGTGGGTAACCAA" etc.最多允许 3 个替换,例如“AGTGGGTAACCAA”等。
Insertions/deletions (having a partial match like "ACTGGGAAAATAAACTA" or "ACTAAACTA")插入/删除(具有部分匹配,如“ACTGGGAAAATAAACTA”或“ACTAAACTA”)
There used to be a tool called agrep
for fuzzy regex matching, but it got abandoned. 曾经有一个名为agrep
的工具用于模糊正则表达式匹配,但它被放弃了。
http://en.wikipedia.org/wiki/Agrep has a bit of history and links to related tools. http://en.wikipedia.org/wiki/Agrep有一些历史记录和相关工具的链接。
https://github.com/Wikinaut/agrep looks like a revived open source release, but I have not tested it. https://github.com/Wikinaut/agrep看起来像一个复活的开源版本,但我还没有测试过它。
Failing that, see if you can find tre-agrep
for your distro. 如果做不到这一点,看看你是否能为你的发行版找到tre-agrep
。
You can use tre-agrep
and specify the edit distance with the -E
switch. 您可以使用tre-agrep
并使用-E
开关指定编辑距离 。 For example if you have a file foo
: 例如,如果你有一个文件foo
:
cat <<< EOF > foo
ACTGGGAAAATAAACTA
ACTAAACTA
ACTGGGTAAACTA
EOF
You can match every line with an edit distance of up to 9 like this: 您可以将每行编辑距离最多匹配为9,如下所示:
tre-agrep -s -9 -w ACTGGGTAAACTA foo
Output: 输出:
4:ACTGGGAAAATAAACTA
4:ACTAAACTA
0:ACTGGGTAAACTA
Short answer : no. 简答 :不。
Long answer : As @JDB said , regex is inherently precise. 答案很长 :正如@JDB所说 ,正则表达式本质上是精确的。 You can manually add in mismatches like [ATGC]
instead of A
in some spot, but there is no way to only allow a small amount of any mismatches. 你可以在某些地方手动添加像[ATGC]
而不是A
不匹配,但是没有办法只允许少量任何不匹配。 I suggest that you write your own code to parse it, or try to find a DNA parser somewhere. 我建议您编写自己的代码来解析它,或尝试在某处找到DNA解析器。
There's a Python library called fuzzysearch (that I wrote) which provides precisely the required functionality. 有一个名为fuzzysearch的Python库(我写过),它提供了所需的功能。
Here's some sample code that should work: 以下是一些应该有效的示例代码:
from fuzzysearch import find_near_matches
with open('path/to/file', 'r') as f:
data = f.read()
# 1. search allowing up to 3 substitutions
matches = find_near_matches("ACTGGGTAAACTA", data, max_substitutions=3)
# 2. also allow insertions and deletions, i.e. allow an edit distance
# a.k.a. Levenshtein distance of up to 3
matches = find_near_matches("ACTGGGTAAACTA", data, max_l_dist=3)
Remember the GNU/Linux philosophy , specifically the modularity concept, which enable us to handle small-but-powerful pieces independently.请记住GNU/Linux 哲学,特别是模块化概念,它使我们能够独立处理小而强大的部分。 We can gather a bunch of these small pieces to make magic.我们可以收集一堆这样的小碎片来制造魔法。 That is beauty of GNU/Linux这就是 GNU/Linux 的美妙之处
cat file | fzf --filter='ACTGGGTAAACTA'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.