用Java实现获得共识序列的好方法是什么？

Question

I have the following problem: 我有以下问题：

I have 2 Strings of DNA Sequences (consisting of ACGT), which differ in one or two spots. 我有2个DNA序列字符串（由ACGT组成），它们在一个或两个点上有所不同。
Finding the differences is trivial, so let's just ignore that 发现差异是微不足道的，所以我们就忽略它
for each difference, I want to get the consensus symbol (eg M for A or C) that represents both possibilities 对于每个差异，我想获得表示两种可能性的共识符号（例如，M表示A或C）

I know I could just make a huge if-cascade but I guess that's not only ugly and hard to maintain, but also slow. 我知道我可以创建一个很大的if-cascade，但是我认为这不仅丑陋且难以维护，而且速度很慢。

What is a fast, easy to maintain way to implement that? 什么是快速，易于维护的实施方式？ Some kind of lookup table perhaps, or a matrix for the combinations? 某种查找表，还是组合的矩阵？ Any code samples would be greatly appreciated. 任何代码示例将不胜感激。 I would have used Biojava, but the current version I am already using does not offer that functionality (or I haven't found it yet...). 我应该使用Biojava，但是我已经使用的当前版本不提供该功能（或者我还没有找到它……）。

Update : there seems to be a bit of confusion here. 更新：这里似乎有些混乱。 The consensus symbol is a single char, that stands for a single char in both sequences. 共识符号是单个字符，代表两个序列中的单个字符。

String1 and String2 are, for example "ACGT" and "ACCT" - they mismatch on position 2. Sooo, I want a consensus string to be ACST, because S stands for "either C or G" String1和String2例如是“ ACGT”和“ ACCT”-它们在位置2不匹配。所以，我希望一个共识字符串是ACST，因为S代表“ C或G”

I want to make a method like this: 我想做一个这样的方法：

char getConsensus(char a, char b)

Update 2 : some of the proposed methods work if I only have 2 sequences. 更新2 ：如果我只有2个序列，则某些建议的方法可以工作。 I might need to do several iterations of these "consensifications", so the input alphabet could increase from "ACGT" to "ACGTRYKMSWBDHVN" which would make some of the proposed approaches quite unwieldy to write and maintain. 我可能需要对这些“ consensification”进行多次迭代，因此输入字母可能会从“ ACGT”增加到“ ACGTRYKMSWBDHVN”，这会使某些建议的方法很难编写和维护。

Answer 1

You can just use a HashMap<String, String> which maps the conflicts/differences to the consensus symbols. 您可以只使用HashMap<String, String>将冲突/差异映射到共识符号。 You can either "hard code" (fill in the code of your app) or fill it during the startup of your app from some outside source (a file, database etc.). 您可以“硬编码”（填写应用程序的代码），也可以在应用程序启动期间从某些外部源（文件，数据库等）进行填写。 Then you just use it whenever you have a difference. 然后，只要有区别，就使用它。

String consensusSymbol = consensusMap.get(differenceString);

EDIT: To accomodate your API request ;] 编辑：以适应您的API请求；]

Map<String, Character> consensusMap; // let's assume this is filled somewhere
...
char getConsensus(char a, char b) {
    return consensusMap.get("" + a + b);
}

I realize this look crude but I think you get the point. 我意识到这看起来很粗糙，但我想您明白了。 This might be slightly slower than a lookup table but it's also a lot easier to maintain. 这可能比查找表要慢一些，但维护起来也容易得多。

YET ANOTHER EDIT: 还需要编辑：

If you really want something super fast and you actuall use the char type you can just create a 2d table and index it with characters (since they're interpreted as numbers). 如果您确实想要超快速的东西，并且实际上使用了char类型，则可以只创建一个2d表并用字符将其索引（因为它们被解释为数字）。

char lookup[][] = new char[256][256]; // all "english" letters will be below 256
//... fill it... e. g. lookup['A']['C'] = 'M';
char consensus = lookup['A']['C'];

Answer 2

A simple, fast solution is to use bitwise-OR. 一个简单，快速的解决方案是使用按位或。

At startup, initialize two tables: 在启动时，初始化两个表：

A sparse 128-element table to map a nucleotide to a single bit. 稀疏的128元素表，用于将核苷酸映射到单个位。 'Sparse' means you only have to set the members that you'll use: the IUPAC codes in upper and lowercase. “稀疏”意味着您只需要设置将要使用的成员：IUPAC代码使用大写和小写形式。
A 16-element table to map a bitwise consensus to an IUPAC nucleotide code. 一个16元素的表格，用于将按位共识映射到IUPAC核苷酸代码。

To get the consensus for a single position: 要获得一个职位的共识：

Use the nucleotides as indices in the first table, to get the bitwise representations. 在第一个表中使用核苷酸作为索引，以获取按位表示。
Bitwise-OR the bitwise representations. 按位或或按位表示。
Use the bitwise-OR as an index into the 16-element table. 使用按位或运算符作为16元素表的索引。

Here's a simple bitwise representation to get you started: 这是一个简单的按位表示形式，可以帮助您入门：

    private static final int A = 1 << 3;
    private static final int C = 1 << 2;
    private static final int G = 1 << 1;
    private static final int T = 1 << 0;

Set the members of the first table like this: 像这样设置第一个表的成员：

    characterToBitwiseTable[ 'd' ] = A | G | T;
    characterToBitwiseTable[ 'D' ] = A | G | T;

Set the members of the second table like this: 像这样设置第二个表的成员：

    bitwiseToCharacterTable[ A | G | T ] = 'd';

Answer 3

The possible combinations are around 20. So there is not a real performace issue. 可能的组合大约为20。因此，没有实际的性能问题。 If you do not wish to do a big if else block, the fastest solution would be to build a Tree data structure. 如果您不希望执行其他操作，则最快的解决方案是构建Tree数据结构。 http://en.wikipedia.org/wiki/Tree_data_structure . http://en.wikipedia.org/wiki/Tree_data_structure 。 This is the fastest way to do what you want to do. 这是做您想做的事的最快方法。

In a tree, you put all the possible combinations and you input the string and it traverses the tree to find the longest matching sequence for a symbol 在树中，放置所有可能的组合，然后输入字符串，然后遍历树以找到符号的最长匹配序列

Do you want an illustrated example? 您是否需要一个图解示例？

PS : All Artificial Intelligence softwares uses the Tree apporach which is the fastest and the most adapted. PS ：所有人工智能软件都使用最快，最适应的Tree方法。

Answer 4

Given that they are all unique symbols, I'd go for an Enum : 鉴于它们都是唯一的符号，我将进行Enum ：

public Enum ConsensusSymbol
{
    A("A"), // simple case
    // ....
    GTUC("B"),
    // etc
    // last entry:
    AGCTU("N");

    // Not sure what X means?

    private final String symbol;

    ConsensusSymbol(final String symbol)
    {
        this.symbol = symbol;
    }

    public String getSymbol()
    {
        return symbol;
    }
}

Then, when you encounter a difference, use .valueOf() : 然后，当您遇到差异时，请使用.valueOf() ：

final ConsensusSymbol symbol;

try {
    symbol = ConsensusSymbol.valueOf("THESEQUENCE");
} catch (IllegalArgumentException e) { // Unknown sequence
    // TODO
}

For instance, if you encounter GTUC as a String, Enum.valueOf("GTUC") will return the GTUC enum value, and calling getSymbol() on that value will return "B" . 例如，如果遇到GTUC作为字符串，则Enum.valueOf("GTUC")将返回GTUC枚举值，并且getSymbol()值调用getSymbol()将返回"B" 。

Answer 5

Considered reading multiple sequences at once - I would: 考虑一次读取多个序列-我会：

put all characters from the same position in the sequence to a set 将序列中相同位置的所有字符放到一个集合中
sort and concatenate values in the set and use enum.valueOf() as in fge's example 对集合中的值进行排序和连接，并使用fum的示例中的enum.valueOf（）
acquired value use as a key to a EnumMap having consesus symbols as a values 获得的值用作以consumus符号作为值的EnumMap的键

There are probably ways hot o optimize the second and the first steps. 可能有一些方法可以优化第二步和第一步。

Answer 6

A possible solution using enums, inspired by pablochan, with a little input from biostar.stackexchange.com : 可能的解决方案使用受pablochan启发的枚举，并从biostar.stackexchange.com输入一些信息：

enum lut {
     AA('A'), AC('M'), AG('R'), AT('W'), AR('R'), AY('H'), AK('D'), AM('M'), AS('V'), AW('W'), AB('N'), AD('D'), AH('H'), AV('V'), AN('N'),
     CA('M'), CC('C'), CG('S'), CT('Y'), CR('V'), CY('Y'), CK('B'), CM('M'), CS('S'), CW('H'), CB('B'), CD('N'), CH('H'), CV('V'), CN('N'),
     GA('R'), GC('S'), GG('G'), GT('K'), GR('R'), GY('B'), GK('K'), GM('V'), GS('S'), GW('D'), GB('B'), GD('D'), GH('N'), GV('V'), GN('N'),
     TA('W'), TC('Y'), TG('K'), TT('T'), TR('D'), TY('Y'), TK('K'), TM('H'), TS('B'), TW('W'), TB('B'), TD('D'), TH('H'), TV('N'), TN('N'),
     RA('R'), RC('V'), RG('R'), RT('D'), RR('R'), RY('N'), RK('D'), RM('V'), RS('V'), RW('D'), RB('N'), RD('D'), RH('N'), RV('V'), RN('N'),
     YA('H'), YC('Y'), YG('B'), YT('Y'), YR('N'), YY('Y'), YK('B'), YM('H'), YS('B'), YW('H'), YB('B'), YD('N'), YH('H'), YV('N'), YN('N'),
     KA('D'), KC('B'), KG('K'), KT('K'), KR('D'), KY('B'), KK('K'), KM('N'), KS('B'), KW('D'), KB('B'), KD('D'), KH('N'), KV('N'), KN('N'),
     MA('M'), MC('M'), MG('V'), MT('H'), MR('V'), MY('H'), MK('N'), MM('M'), MS('V'), MW('H'), MB('N'), MD('N'), MH('H'), MV('V'), MN('N'),
     SA('V'), SC('S'), SG('S'), ST('B'), SR('V'), SY('B'), SK('B'), SM('V'), SS('S'), SW('N'), SB('B'), SD('N'), SH('N'), SV('V'), SN('N'),
     WA('W'), WC('H'), WG('D'), WT('W'), WR('D'), WY('H'), WK('D'), WM('H'), WS('N'), WW('W'), WB('N'), WD('D'), WH('H'), WV('N'), WN('N'), 
     BA('N'), BC('B'), BG('B'), BT('B'), BR('N'), BY('B'), BK('B'), BM('N'), BS('B'), BW('N'), BB('B'), BD('N'), BH('N'), BV('N'), BN('N'),
     DA('D'), DC('N'), DG('D'), DT('D'), DR('D'), DY('N'), DK('D'), DM('N'), DS('N'), DW('D'), DB('N'), DD('D'), DH('N'), DV('N'), DN('N'),
     HA('H'), HC('H'), HG('N'), HT('H'), HR('N'), HY('H'), HK('N'), HM('H'), HS('N'), HW('H'), HB('N'), HD('N'), HH('H'), HV('N'), HN('N'),
     VA('V'), VC('V'), VG('V'), VT('N'), VR('V'), VY('N'), VK('N'), VM('V'), VS('V'), VW('N'), VB('N'), VD('N'), VH('N'), VV('V'), VN('N'),
     NA('N'), NC('N'), NG('N'), NT('N'), NR('N'), NY('N'), NK('N'), NM('N'), NS('N'), NW('N'), NB('N'), ND('N'), NH('N'), NV('N'), NN('N');

     char consensusChar = 'X';

     lut(char c) {
         consensusChar = c;
     }

     char getConsensusChar() {
         return consensusChar;
     }
}

char getConsensus(char a, char b) {
    return lut.valueOf("" + a + b).getConsensusChar();
}

用Java实现获得共识序列的好方法是什么？

问题描述

6 个解决方案

解决方案1
2 2011-12-21 13:30:49

解决方案2
2 2011-12-21 15:50:12

解决方案3
0 2011-12-21 13:15:49

解决方案4
0 2011-12-21 13:16:42

解决方案5
0 2011-12-21 14:31:38

解决方案6
0 已采纳 2011-12-22 08:03:47

用Java实现获得共识序列的好方法是什么？

问题描述

6 个解决方案

解决方案1 2 2011-12-21 13:30:49

解决方案2 2 2011-12-21 15:50:12

解决方案3 0 2011-12-21 13:15:49

解决方案4 0 2011-12-21 13:16:42

解决方案5 0 2011-12-21 14:31:38

解决方案6 0 已采纳 2011-12-22 08:03:47

解决方案1
2 2011-12-21 13:30:49

解决方案2
2 2011-12-21 15:50:12

解决方案3
0 2011-12-21 13:15:49

解决方案4
0 2011-12-21 13:16:42

解决方案5
0 2011-12-21 14:31:38

解决方案6
0 已采纳 2011-12-22 08:03:47