[英]What is a good way to implement getting a consensus sequence in Java?
I have the following problem: 我有以下问题:
I know I could just make a huge if-cascade but I guess that's not only ugly and hard to maintain, but also slow. 我知道我可以创建一个很大的if-cascade,但是我认为这不仅丑陋且难以维护,而且速度很慢。
What is a fast, easy to maintain way to implement that? 什么是快速,易于维护的实施方式? Some kind of lookup table perhaps, or a matrix for the combinations?
某种查找表,还是组合的矩阵? Any code samples would be greatly appreciated.
任何代码示例将不胜感激。 I would have used Biojava, but the current version I am already using does not offer that functionality (or I haven't found it yet...).
我应该使用Biojava,但是我已经使用的当前版本不提供该功能(或者我还没有找到它……)。
Update : there seems to be a bit of confusion here. 更新 :这里似乎有些混乱。 The consensus symbol is a single char, that stands for a single char in both sequences.
共识符号是单个字符,代表两个序列中的单个字符。
String1 and String2 are, for example "ACGT" and "ACCT" - they mismatch on position 2. Sooo, I want a consensus string to be ACST, because S stands for "either C or G" String1和String2例如是“ ACGT”和“ ACCT”-它们在位置2不匹配。所以,我希望一个共识字符串是ACST,因为S代表“ C或G”
I want to make a method like this: 我想做一个这样的方法:
char getConsensus(char a, char b)
Update 2 : some of the proposed methods work if I only have 2 sequences. 更新2 :如果我只有2个序列,则某些建议的方法可以工作。 I might need to do several iterations of these "consensifications", so the input alphabet could increase from "ACGT" to "ACGTRYKMSWBDHVN" which would make some of the proposed approaches quite unwieldy to write and maintain.
我可能需要对这些“ consensification”进行多次迭代,因此输入字母可能会从“ ACGT”增加到“ ACGTRYKMSWBDHVN”,这会使某些建议的方法很难编写和维护。
You can just use a HashMap<String, String>
which maps the conflicts/differences to the consensus symbols. 您可以只使用
HashMap<String, String>
将冲突/差异映射到共识符号。 You can either "hard code" (fill in the code of your app) or fill it during the startup of your app from some outside source (a file, database etc.). 您可以“硬编码”(填写应用程序的代码),也可以在应用程序启动期间从某些外部源(文件,数据库等)进行填写。 Then you just use it whenever you have a difference.
然后,只要有区别,就使用它。
String consensusSymbol = consensusMap.get(differenceString);
EDIT: To accomodate your API request ;] 编辑:以适应您的API请求;]
Map<String, Character> consensusMap; // let's assume this is filled somewhere
...
char getConsensus(char a, char b) {
return consensusMap.get("" + a + b);
}
I realize this look crude but I think you get the point. 我意识到这看起来很粗糙,但我想您明白了。 This might be slightly slower than a lookup table but it's also a lot easier to maintain.
这可能比查找表要慢一些,但维护起来也容易得多。
YET ANOTHER EDIT: 还需要编辑:
If you really want something super fast and you actuall use the char
type you can just create a 2d table and index it with characters (since they're interpreted as numbers). 如果您确实想要超快速的东西,并且实际上使用了
char
类型,则可以只创建一个2d表并用字符将其索引(因为它们被解释为数字)。
char lookup[][] = new char[256][256]; // all "english" letters will be below 256
//... fill it... e. g. lookup['A']['C'] = 'M';
char consensus = lookup['A']['C'];
A simple, fast solution is to use bitwise-OR. 一个简单,快速的解决方案是使用按位或。
At startup, initialize two tables: 在启动时,初始化两个表:
To get the consensus for a single position: 要获得一个职位的共识:
Here's a simple bitwise representation to get you started: 这是一个简单的按位表示形式,可以帮助您入门:
private static final int A = 1 << 3;
private static final int C = 1 << 2;
private static final int G = 1 << 1;
private static final int T = 1 << 0;
Set the members of the first table like this: 像这样设置第一个表的成员:
characterToBitwiseTable[ 'd' ] = A | G | T;
characterToBitwiseTable[ 'D' ] = A | G | T;
Set the members of the second table like this: 像这样设置第二个表的成员:
bitwiseToCharacterTable[ A | G | T ] = 'd';
The possible combinations are around 20. So there is not a real performace issue. 可能的组合大约为20。因此,没有实际的性能问题。 If you do not wish to do a big if else block, the fastest solution would be to build a Tree data structure.
如果您不希望执行其他操作,则最快的解决方案是构建Tree数据结构。 http://en.wikipedia.org/wiki/Tree_data_structure .
http://en.wikipedia.org/wiki/Tree_data_structure 。 This is the fastest way to do what you want to do.
这是做您想做的事的最快方法。
In a tree, you put all the possible combinations and you input the string and it traverses the tree to find the longest matching sequence for a symbol 在树中,放置所有可能的组合,然后输入字符串,然后遍历树以找到符号的最长匹配序列
Do you want an illustrated example? 您是否需要一个图解示例?
PS : All Artificial Intelligence softwares uses the Tree apporach which is the fastest and the most adapted. PS :所有人工智能软件都使用最快,最适应的Tree方法。
Given that they are all unique symbols, I'd go for an Enum
: 鉴于它们都是唯一的符号,我将进行
Enum
:
public Enum ConsensusSymbol
{
A("A"), // simple case
// ....
GTUC("B"),
// etc
// last entry:
AGCTU("N");
// Not sure what X means?
private final String symbol;
ConsensusSymbol(final String symbol)
{
this.symbol = symbol;
}
public String getSymbol()
{
return symbol;
}
}
Then, when you encounter a difference, use .valueOf()
: 然后,当您遇到差异时,请使用
.valueOf()
:
final ConsensusSymbol symbol;
try {
symbol = ConsensusSymbol.valueOf("THESEQUENCE");
} catch (IllegalArgumentException e) { // Unknown sequence
// TODO
}
For instance, if you encounter GTUC
as a String, Enum.valueOf("GTUC")
will return the GTUC
enum value, and calling getSymbol()
on that value will return "B"
. 例如,如果遇到
GTUC
作为字符串,则Enum.valueOf("GTUC")
将返回GTUC
枚举值,并且getSymbol()
值调用getSymbol()
将返回"B"
。
Considered reading multiple sequences at once - I would: 考虑一次读取多个序列-我会:
There are probably ways hot o optimize the second and the first steps. 可能有一些方法可以优化第二步和第一步。
A possible solution using enums, inspired by pablochan, with a little input from biostar.stackexchange.com : 可能的解决方案使用受pablochan启发的枚举,并从biostar.stackexchange.com输入一些信息 :
enum lut {
AA('A'), AC('M'), AG('R'), AT('W'), AR('R'), AY('H'), AK('D'), AM('M'), AS('V'), AW('W'), AB('N'), AD('D'), AH('H'), AV('V'), AN('N'),
CA('M'), CC('C'), CG('S'), CT('Y'), CR('V'), CY('Y'), CK('B'), CM('M'), CS('S'), CW('H'), CB('B'), CD('N'), CH('H'), CV('V'), CN('N'),
GA('R'), GC('S'), GG('G'), GT('K'), GR('R'), GY('B'), GK('K'), GM('V'), GS('S'), GW('D'), GB('B'), GD('D'), GH('N'), GV('V'), GN('N'),
TA('W'), TC('Y'), TG('K'), TT('T'), TR('D'), TY('Y'), TK('K'), TM('H'), TS('B'), TW('W'), TB('B'), TD('D'), TH('H'), TV('N'), TN('N'),
RA('R'), RC('V'), RG('R'), RT('D'), RR('R'), RY('N'), RK('D'), RM('V'), RS('V'), RW('D'), RB('N'), RD('D'), RH('N'), RV('V'), RN('N'),
YA('H'), YC('Y'), YG('B'), YT('Y'), YR('N'), YY('Y'), YK('B'), YM('H'), YS('B'), YW('H'), YB('B'), YD('N'), YH('H'), YV('N'), YN('N'),
KA('D'), KC('B'), KG('K'), KT('K'), KR('D'), KY('B'), KK('K'), KM('N'), KS('B'), KW('D'), KB('B'), KD('D'), KH('N'), KV('N'), KN('N'),
MA('M'), MC('M'), MG('V'), MT('H'), MR('V'), MY('H'), MK('N'), MM('M'), MS('V'), MW('H'), MB('N'), MD('N'), MH('H'), MV('V'), MN('N'),
SA('V'), SC('S'), SG('S'), ST('B'), SR('V'), SY('B'), SK('B'), SM('V'), SS('S'), SW('N'), SB('B'), SD('N'), SH('N'), SV('V'), SN('N'),
WA('W'), WC('H'), WG('D'), WT('W'), WR('D'), WY('H'), WK('D'), WM('H'), WS('N'), WW('W'), WB('N'), WD('D'), WH('H'), WV('N'), WN('N'),
BA('N'), BC('B'), BG('B'), BT('B'), BR('N'), BY('B'), BK('B'), BM('N'), BS('B'), BW('N'), BB('B'), BD('N'), BH('N'), BV('N'), BN('N'),
DA('D'), DC('N'), DG('D'), DT('D'), DR('D'), DY('N'), DK('D'), DM('N'), DS('N'), DW('D'), DB('N'), DD('D'), DH('N'), DV('N'), DN('N'),
HA('H'), HC('H'), HG('N'), HT('H'), HR('N'), HY('H'), HK('N'), HM('H'), HS('N'), HW('H'), HB('N'), HD('N'), HH('H'), HV('N'), HN('N'),
VA('V'), VC('V'), VG('V'), VT('N'), VR('V'), VY('N'), VK('N'), VM('V'), VS('V'), VW('N'), VB('N'), VD('N'), VH('N'), VV('V'), VN('N'),
NA('N'), NC('N'), NG('N'), NT('N'), NR('N'), NY('N'), NK('N'), NM('N'), NS('N'), NW('N'), NB('N'), ND('N'), NH('N'), NV('N'), NN('N');
char consensusChar = 'X';
lut(char c) {
consensusChar = c;
}
char getConsensusChar() {
return consensusChar;
}
}
char getConsensus(char a, char b) {
return lut.valueOf("" + a + b).getConsensusChar();
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.