[英]Is there a generalization of std::bitset for two-bit values?
Suppose I am a genome scientist trying to store extremely long strings of characters, each of which represents two bits of information (ie each element is either G, A, T, or C). 假设我是一名基因组科学家,试图存储极长的字符串,每个字符串代表两位信息(即每个元素是G,A,T或C)。 Because the strings are incredibly long, I need to be able to store a string of length N in precisely 2N bits (or rather, N/4 bytes).
因为字符串非常长,所以我需要能够以精确的2N位(或者更确切地说,N / 4字节)存储长度为N的字符串。
With that motivation in mind, I am looking for a generalization of std::bitset
(or boost::dynamic_bitset<>
) that works on two-bit values instead of single-bit values. 考虑到这种动机,我正在寻找
std::bitset
(或boost::dynamic_bitset<>
)的推广,它适用于两位值而不是单位值。 I want to store N
such two-bit values, each of which can be 0, 1, 2, or 3. I need the data packed as closely as possible in memory, so vector<char>
will not work (as it wastes a factor of 4 of memory). 我想存储
N
这样的两位值,每个值可以是0,1,2或3.我需要在内存中尽可能紧密地打包数据,因此vector<char>
将无法工作(因为它浪费了一个记忆因子4)。
What is the best way to achieve my goal? 实现目标的最佳方法是什么? One option is to wrap the existing bitset templates with customized
operator[]
, iterators, etc., but I'd prefer to use an existing library if at all possible. 一种选择是使用自定义的
operator[]
,迭代器等包装现有的bitset模板,但是如果可能的话,我更愿意使用现有的库。
std::bitset<>
is fixed length and you probably do not want that. std::bitset<>
是固定长度,你可能不希望这样。
I think you should go ahead and wrap std::vector<bool>
. 我认为你应该继续包装
std::vector<bool>
。
Note that std::vector<bool>
is optimised for space, but has the benefit that it is dynamic in size. 注意,
std::vector<bool>
被优化空间,但它是在大小动态的益处。 Presumably you need to read the genome of arbitrary length on from somewhere. 据推测,你需要从某个地方读取任意长度的基因组。
Have a think about whether you need much of an API to access it; 想一想你是否需要大量的API才能访问它; you might only need a couple of methods.
你可能只需要几种方法。
@Jefffrey's answer already covers the relevant code, if for bitset<>
. @ Jefffrey的答案已经涵盖了相关代码,如果是
bitset<>
。
[ I am not familiar with boost::dynamic_bitset<>
and what it might give over vector
.] [我不熟悉
boost::dynamic_bitset<>
以及它可能会给vector
。]
One further thought is whether it might be convenient for you to work with quads of letters, a quad nicely filling a char in space. 还有一个想法是,你是否可以方便地使用四边形字母,四边形很好地填充空间中的字符。
class Genome
{
public:
enum class Letter {A,C,G,T};
Genome(const std::string& source)
{
code_.resize(source.size() * 2);
for (unsigned index = 0; index != source.size(); ++index)
{
char text = source[index];
Letter letter = textToLetter(text);
set(index, letter);
}
}
static Letter textToLetter(char text)
{
// Or search through the array `letterText`.
// Or come up with a neat but unintelligible one liner ...
Letter letter = Letter::A;
switch (text)
{
case 'A':
letter = Letter::A;
break;
case 'C':
letter = Letter::C;
break;
case 'G':
letter = Letter::G;
break;
case 'T':
letter = Letter::T;
break;
default:
// Invalid - handle error.
break;
}
return letter;
}
static char letterToText(Letter l)
{
return letterText[(unsigned)l];
}
// Add bounds checking
Letter get(unsigned index) const
{
unsigned distance = index * 2;
char numeric = code_[distance] + code_[distance + 1] * 2;
return Letter(numeric);
}
// Add bounds checking
void set(unsigned index, Letter value)
{
unsigned distance = index * 2;
bool low = (unsigned)value & 1;
bool high = (bool)((unsigned)value & 2);
code_[distance] = low;
code_[distance + 1] = high;
}
unsigned size()
{
return code_.size() / 2;
}
// Extend by numLetters, initially set to 'A'
void extend(unsigned numLetters)
{
code_.resize(code_.size() + numLetters * 2);
}
private:
static char letterText[4];
std::vector<bool> code_;
};
char Genome::letterText [4] = { 'A', 'C', 'G', 'T' };
int main()
{
Genome g("GATT");
g.extend(3);
g.set(5, Genome::Letter::C);
for (unsigned i = 0; i != g.size(); ++i)
std::cout << Genome::letterToText(g.get(i));
std::cout << std::endl;
return 0;
}
You have two choices. 你有两个选择。
Given: 鉴于:
enum class nucleobase { a, c, g, t };
You have two choices. 你有两个选择。 You can:
您可以:
std::bitset
and play with indexing std::bitset
并使用索引std::bitset
std::bitset
in combination with another container std::bitset
与另一个容器结合使用 For the first, you can just define a couple of functions that target the correct number of bits per set/get: 首先,您可以定义一些针对每组/ get的正确位数的函数:
template<std::size_t N>
void set(std::bitset<N>& bits, std::size_t i, nucleobase x) {
switch (x) {
case nucleobase::a: bits.set(i * 2, 0); bits.set(i * 2 + 1, 0); break;
case nucleobase::c: bits.set(i * 2, 0); bits.set(i * 2 + 1, 1); break;
case nucleobase::g: bits.set(i * 2, 1); bits.set(i * 2 + 1, 0); break;
case nucleobase::t: bits.set(i * 2, 1); bits.set(i * 2 + 1, 1); break;
}
}
template<std::size_t N>
nucleobase get(const std::bitset<N>& bits, std::size_t i) {
if (!bits[i * 2])
if (!bits[i * 2 + 1]) return nucleobase::a;
else return nucleobase::c;
else
if (!bits[i * 2 + 1]) return nucleobase::g;
else return nucleobase::t;
}
The above is just an example and a terrible one (it's almost 4AM here and I really need to sleep). 上面只是一个例子而且是一个可怕的例子(这里差不多是凌晨4点,我真的需要睡觉)。
For the second you just need to map alleles and bits: 对于第二个,您只需要映射等位基因和位:
bit_pair bits_for(nucleobase x) {
switch (x) {
case nucleobase::a: return bit_pair("00"); break;
case nucleobase::c: return bit_pair("10"); break;
case nucleobase::g: return bit_pair("01"); break;
case nucleobase::t: return bit_pair("11"); break;
}
}
nucleobase nucleobase_for(bit_pair x) {
switch (x.to_ulong()) {
case 0: return nucleobase::a; break;
case 1: return nucleobase::c; break;
case 2: return nucleobase::g; break;
case 3: return nucleobase::t; break;
default: return nucleobase::a; break; // just for the warning
}
}
Of course if you need runtime length you can just use boost::dynamic_bitset
and std::vector
. 当然,如果你需要运行时长度,你可以使用
boost::dynamic_bitset
和std::vector
。
Here's what I use for fixed-length k-mers. 这是我用于固定长度k-mers的内容。
#include <cstdint>
#include <cstdlib>
#include <ostream>
enum class nucleotide { A, C, G, T };
inline std::ostream&
operator<<(std::ostream& pOut, nucleotide pNt)
{
switch (pNt) {
case nucleotide::A: pOut << 'A'; break;
case nucleotide::C: pOut << 'C'; break;
case nucleotide::G: pOut << 'G'; break;
case nucleotide::T: pOut << 'T'; break;
}
return pOut;
}
class kmer_base;
class nucleotide_proxy {
public:
operator nucleotide() const {
return nucleotide((*mWord >> (mPosition * 2)) & 3);
};
nucleotide_proxy& operator=(nucleotide pNt) {
uint64_t word = *mWord;
word &= ~(uint64_t(3) << (mPosition*2));
word |= uint64_t(pNt) << (mPosition*2);
*mWord = word;
return *this;
};
private:
friend class kmer_base;
nucleotide_proxy(uint64_t* pWord, uint8_t pPosition)
: mWord(pWord), mPosition(pPosition)
{
}
uint64_t* mWord;
uint8_t mPosition;
};
class kmer_base {
protected:
nucleotide_proxy access(uint64_t* pWord, size_t pPosition)
{
return nucleotide_proxy(pWord + (pPosition / 32), (pPosition & 31));
}
const nucleotide_proxy access(uint64_t* pWord, size_t pPosition) const
{
return nucleotide_proxy(pWord + (pPosition / 32), (pPosition & 31));
}
};
template<int K>
class kmer : public kmer_base
{
enum { Words = (K + 31) / 32 };
public:
nucleotide_proxy operator[](size_t pOutdex) {
return access(mWords, pOutdex);
}
const nucleotide_proxy operator[](size_t pOutdex) const {
return access(mWords, pOutdex);
}
private:
uint64_t mWords[Words];
};
Extending this to dynamic-length k-mere is left as an exercise; 将其延伸到动态长度的k-mere是一种练习; it's pretty easy once you have
nucleotide_proxy
at your disposal. 一旦你拥有了
nucleotide_proxy
就很容易了。 Implementing the reverse complement operator efficiently is also left as an exercise. 有效地实施反向补充算子也是一种练习。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.