与 grep 匹配的模糊字符串

Question

I am trying to match rows in a file containing a string say ACTGGGTAAACTA .我正在尝试匹配包含ACTGGGTAAACTA字符串的文件中的行。 If I do如果我做

grep "ACTGGGTAAACTA" file

It gives me rows which have exact matches.它给了我完全匹配的行。 Is there a way to allow for certain number of mismatches (substitutions, insertions or deletions)?有没有办法允许一定数量的不匹配（替换、插入或删除）？ For example, I am looking for sequences例如，我正在寻找序列

Up to 3 allowed subtitutions like "AGTGGGTAACCAA" etc.最多允许 3 个替换，例如“AGTGGGTAACCAA”等。
Insertions/deletions (having a partial match like "ACTGGGAAAATAAACTA" or "ACTAAACTA")插入/删除（具有部分匹配，如“ACTGGGAAAATAAACTA”或“ACTAAACTA”）

Answer 1

There used to be a tool called agrep for fuzzy regex matching, but it got abandoned. 曾经有一个名为agrep的工具用于模糊正则表达式匹配，但它被放弃了。

http://en.wikipedia.org/wiki/Agrep has a bit of history and links to related tools. http://en.wikipedia.org/wiki/Agrep有一些历史记录和相关工具的链接。

https://github.com/Wikinaut/agrep looks like a revived open source release, but I have not tested it. https://github.com/Wikinaut/agrep看起来像一个复活的开源版本，但我还没有测试过它。

Failing that, see if you can find tre-agrep for your distro. 如果做不到这一点，看看你是否能为你的发行版找到tre-agrep 。

Answer 2

You can use tre-agrep and specify the edit distance with the -E switch. 您可以使用tre-agrep并使用-E开关指定编辑距离。 For example if you have a file foo : 例如，如果你有一个文件foo ：

cat <<< EOF > foo
ACTGGGAAAATAAACTA
ACTAAACTA
ACTGGGTAAACTA
EOF

You can match every line with an edit distance of up to 9 like this: 您可以将每行编辑距离最多匹配为9，如下所示：

tre-agrep -s -9 -w ACTGGGTAAACTA foo

Output: 输出：

4:ACTGGGAAAATAAACTA
4:ACTAAACTA
0:ACTGGGTAAACTA

Answer 3

Short answer : no. 简答：不。

Long answer : As @JDB said , regex is inherently precise. 答案很长 ：正如@JDB所说，正则表达式本质上是精确的。 You can manually add in mismatches like [ATGC] instead of A in some spot, but there is no way to only allow a small amount of any mismatches. 你可以在某些地方手动添加像[ATGC]而不是A不匹配，但是没有办法只允许少量任何不匹配。 I suggest that you write your own code to parse it, or try to find a DNA parser somewhere. 我建议您编写自己的代码来解析它，或尝试在某处找到DNA解析器。

Answer 4

There's a Python library called fuzzysearch (that I wrote) which provides precisely the required functionality. 有一个名为fuzzysearch的Python库（我写过），它提供了所需的功能。

Here's some sample code that should work: 以下是一些应该有效的示例代码：

from fuzzysearch import find_near_matches

with open('path/to/file', 'r') as f:
    data = f.read()

# 1. search allowing up to 3 substitutions
matches = find_near_matches("ACTGGGTAAACTA", data, max_substitutions=3)

# 2. also allow insertions and deletions, i.e. allow an edit distance
#    a.k.a. Levenshtein distance of up to 3
matches = find_near_matches("ACTGGGTAAACTA", data, max_l_dist=3)

Answer 5

Remember the GNU/Linux philosophy , specifically the modularity concept, which enable us to handle small-but-powerful pieces independently.请记住GNU/Linux 哲学，特别是模块化概念，它使我们能够独立处理小而强大的部分。 We can gather a bunch of these small pieces to make magic.我们可以收集一堆这样的小碎片来制造魔法。 That is beauty of GNU/Linux这就是 GNU/Linux 的美妙之处

cat file | fzf --filter='ACTGGGTAAACTA'

check fzf here :)在这里检查fzf :)

与 grep 匹配的模糊字符串

问题描述

5 个解决方案

解决方案1
5 2015-05-20 18:03:48

解决方案2
2 2016-12-09 20:28:59

解决方案3
0 2015-05-20 17:16:58

解决方案4
0 2018-09-09 19:07:39

解决方案5
0 2023-01-20 00:31:37

与 grep 匹配的模糊字符串

问题描述

5 个解决方案

解决方案1 5 2015-05-20 18:03:48

解决方案2 2 2016-12-09 20:28:59

解决方案3 0 2015-05-20 17:16:58

解决方案4 0 2018-09-09 19:07:39

解决方案5 0 2023-01-20 00:31:37

解决方案1
5 2015-05-20 18:03:48

解决方案2
2 2016-12-09 20:28:59

解决方案3
0 2015-05-20 17:16:58

解决方案4
0 2018-09-09 19:07:39

解决方案5
0 2023-01-20 00:31:37