简体   繁体   中英

What is a good way to implement getting a consensus sequence in Java?

I have the following problem:

  • I have 2 Strings of DNA Sequences (consisting of ACGT), which differ in one or two spots.
  • Finding the differences is trivial, so let's just ignore that
  • for each difference, I want to get the consensus symbol (eg M for A or C) that represents both possibilities

I know I could just make a huge if-cascade but I guess that's not only ugly and hard to maintain, but also slow.

What is a fast, easy to maintain way to implement that? Some kind of lookup table perhaps, or a matrix for the combinations? Any code samples would be greatly appreciated. I would have used Biojava, but the current version I am already using does not offer that functionality (or I haven't found it yet...).

Update : there seems to be a bit of confusion here. The consensus symbol is a single char, that stands for a single char in both sequences.

String1 and String2 are, for example "ACGT" and "ACCT" - they mismatch on position 2. Sooo, I want a consensus string to be ACST, because S stands for "either C or G"

I want to make a method like this:

char getConsensus(char a, char b)

Update 2 : some of the proposed methods work if I only have 2 sequences. I might need to do several iterations of these "consensifications", so the input alphabet could increase from "ACGT" to "ACGTRYKMSWBDHVN" which would make some of the proposed approaches quite unwieldy to write and maintain.

You can just use a HashMap<String, String> which maps the conflicts/differences to the consensus symbols. You can either "hard code" (fill in the code of your app) or fill it during the startup of your app from some outside source (a file, database etc.). Then you just use it whenever you have a difference.

String consensusSymbol = consensusMap.get(differenceString);

EDIT: To accomodate your API request ;]

Map<String, Character> consensusMap; // let's assume this is filled somewhere
...
char getConsensus(char a, char b) {
    return consensusMap.get("" + a + b);
}

I realize this look crude but I think you get the point. This might be slightly slower than a lookup table but it's also a lot easier to maintain.

YET ANOTHER EDIT:

If you really want something super fast and you actuall use the char type you can just create a 2d table and index it with characters (since they're interpreted as numbers).

char lookup[][] = new char[256][256]; // all "english" letters will be below 256
//... fill it... e. g. lookup['A']['C'] = 'M';
char consensus = lookup['A']['C'];

A simple, fast solution is to use bitwise-OR.

At startup, initialize two tables:

  • A sparse 128-element table to map a nucleotide to a single bit. 'Sparse' means you only have to set the members that you'll use: the IUPAC codes in upper and lowercase.
  • A 16-element table to map a bitwise consensus to an IUPAC nucleotide code.

To get the consensus for a single position:

  1. Use the nucleotides as indices in the first table, to get the bitwise representations.
  2. Bitwise-OR the bitwise representations.
  3. Use the bitwise-OR as an index into the 16-element table.

Here's a simple bitwise representation to get you started:

    private static final int A = 1 << 3;
    private static final int C = 1 << 2;
    private static final int G = 1 << 1;
    private static final int T = 1 << 0; 

Set the members of the first table like this:

    characterToBitwiseTable[ 'd' ] = A | G | T;
    characterToBitwiseTable[ 'D' ] = A | G | T;

Set the members of the second table like this:

    bitwiseToCharacterTable[ A | G | T ] = 'd';

The possible combinations are around 20. So there is not a real performace issue. If you do not wish to do a big if else block, the fastest solution would be to build a Tree data structure. http://en.wikipedia.org/wiki/Tree_data_structure . This is the fastest way to do what you want to do.

In a tree, you put all the possible combinations and you input the string and it traverses the tree to find the longest matching sequence for a symbol

Do you want an illustrated example?

PS : All Artificial Intelligence softwares uses the Tree apporach which is the fastest and the most adapted.

Given that they are all unique symbols, I'd go for an Enum :

public Enum ConsensusSymbol
{
    A("A"), // simple case
    // ....
    GTUC("B"),
    // etc
    // last entry:
    AGCTU("N");

    // Not sure what X means?

    private final String symbol;

    ConsensusSymbol(final String symbol)
    {
        this.symbol = symbol;
    }

    public String getSymbol()
    {
        return symbol;
    }
}

Then, when you encounter a difference, use .valueOf() :

final ConsensusSymbol symbol;

try {
    symbol = ConsensusSymbol.valueOf("THESEQUENCE");
} catch (IllegalArgumentException e) { // Unknown sequence
    // TODO
}

For instance, if you encounter GTUC as a String, Enum.valueOf("GTUC") will return the GTUC enum value, and calling getSymbol() on that value will return "B" .

Considered reading multiple sequences at once - I would:

  1. put all characters from the same position in the sequence to a set
  2. sort and concatenate values in the set and use enum.valueOf() as in fge's example
  3. acquired value use as a key to a EnumMap having consesus symbols as a values

There are probably ways hot o optimize the second and the first steps.

A possible solution using enums, inspired by pablochan, with a little input from biostar.stackexchange.com :

enum lut {
     AA('A'), AC('M'), AG('R'), AT('W'), AR('R'), AY('H'), AK('D'), AM('M'), AS('V'), AW('W'), AB('N'), AD('D'), AH('H'), AV('V'), AN('N'),
     CA('M'), CC('C'), CG('S'), CT('Y'), CR('V'), CY('Y'), CK('B'), CM('M'), CS('S'), CW('H'), CB('B'), CD('N'), CH('H'), CV('V'), CN('N'),
     GA('R'), GC('S'), GG('G'), GT('K'), GR('R'), GY('B'), GK('K'), GM('V'), GS('S'), GW('D'), GB('B'), GD('D'), GH('N'), GV('V'), GN('N'),
     TA('W'), TC('Y'), TG('K'), TT('T'), TR('D'), TY('Y'), TK('K'), TM('H'), TS('B'), TW('W'), TB('B'), TD('D'), TH('H'), TV('N'), TN('N'),
     RA('R'), RC('V'), RG('R'), RT('D'), RR('R'), RY('N'), RK('D'), RM('V'), RS('V'), RW('D'), RB('N'), RD('D'), RH('N'), RV('V'), RN('N'),
     YA('H'), YC('Y'), YG('B'), YT('Y'), YR('N'), YY('Y'), YK('B'), YM('H'), YS('B'), YW('H'), YB('B'), YD('N'), YH('H'), YV('N'), YN('N'),
     KA('D'), KC('B'), KG('K'), KT('K'), KR('D'), KY('B'), KK('K'), KM('N'), KS('B'), KW('D'), KB('B'), KD('D'), KH('N'), KV('N'), KN('N'),
     MA('M'), MC('M'), MG('V'), MT('H'), MR('V'), MY('H'), MK('N'), MM('M'), MS('V'), MW('H'), MB('N'), MD('N'), MH('H'), MV('V'), MN('N'),
     SA('V'), SC('S'), SG('S'), ST('B'), SR('V'), SY('B'), SK('B'), SM('V'), SS('S'), SW('N'), SB('B'), SD('N'), SH('N'), SV('V'), SN('N'),
     WA('W'), WC('H'), WG('D'), WT('W'), WR('D'), WY('H'), WK('D'), WM('H'), WS('N'), WW('W'), WB('N'), WD('D'), WH('H'), WV('N'), WN('N'), 
     BA('N'), BC('B'), BG('B'), BT('B'), BR('N'), BY('B'), BK('B'), BM('N'), BS('B'), BW('N'), BB('B'), BD('N'), BH('N'), BV('N'), BN('N'),
     DA('D'), DC('N'), DG('D'), DT('D'), DR('D'), DY('N'), DK('D'), DM('N'), DS('N'), DW('D'), DB('N'), DD('D'), DH('N'), DV('N'), DN('N'),
     HA('H'), HC('H'), HG('N'), HT('H'), HR('N'), HY('H'), HK('N'), HM('H'), HS('N'), HW('H'), HB('N'), HD('N'), HH('H'), HV('N'), HN('N'),
     VA('V'), VC('V'), VG('V'), VT('N'), VR('V'), VY('N'), VK('N'), VM('V'), VS('V'), VW('N'), VB('N'), VD('N'), VH('N'), VV('V'), VN('N'),
     NA('N'), NC('N'), NG('N'), NT('N'), NR('N'), NY('N'), NK('N'), NM('N'), NS('N'), NW('N'), NB('N'), ND('N'), NH('N'), NV('N'), NN('N');

     char consensusChar = 'X';

     lut(char c) {
         consensusChar = c;
     }

     char getConsensusChar() {
         return consensusChar;
     }
}

char getConsensus(char a, char b) {
    return lut.valueOf("" + a + b).getConsensusChar();
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM