简体   繁体   English

对于两位值是否存在std :: bitset的推广?

[英]Is there a generalization of std::bitset for two-bit values?

Suppose I am a genome scientist trying to store extremely long strings of characters, each of which represents two bits of information (ie each element is either G, A, T, or C). 假设我是一名基因组科学家,试图存储极长的字符串,每个字符串代表两位信息(即每个元素是G,A,T或C)。 Because the strings are incredibly long, I need to be able to store a string of length N in precisely 2N bits (or rather, N/4 bytes). 因为字符串非常长,所以我需要能够以精确的2N位(或者更确切地说,N / 4字节)存储长度为N的字符串。

With that motivation in mind, I am looking for a generalization of std::bitset (or boost::dynamic_bitset<> ) that works on two-bit values instead of single-bit values. 考虑到这种动机,我正在寻找std::bitset (或boost::dynamic_bitset<> )的推广,它适用于两位值而不是单位值。 I want to store N such two-bit values, each of which can be 0, 1, 2, or 3. I need the data packed as closely as possible in memory, so vector<char> will not work (as it wastes a factor of 4 of memory). 我想存储N这样的两位值,每个值可以是0,1,2或3.我需要在内存中尽可能紧密地打包数据,因此vector<char>将无法工作(因为它浪费了一个记忆因子4)。

What is the best way to achieve my goal? 实现目标的最佳方法是什么? One option is to wrap the existing bitset templates with customized operator[] , iterators, etc., but I'd prefer to use an existing library if at all possible. 一种选择是使用自定义的operator[] ,迭代器等包装现有的bitset模板,但是如果可能的话,我更愿意使用现有的库。

std::bitset<> is fixed length and you probably do not want that. std::bitset<>是固定长度,你可能不希望这样。

I think you should go ahead and wrap std::vector<bool> . 我认为你应该继续包装std::vector<bool>

Note that std::vector<bool> is optimised for space, but has the benefit that it is dynamic in size. 注意, std::vector<bool> 优化空间,但它是在大小动态的益处。 Presumably you need to read the genome of arbitrary length on from somewhere. 据推测,你需要从某个地方读取任意长度的基因组。

Have a think about whether you need much of an API to access it; 想一想你是否需要大量的API才能访问它; you might only need a couple of methods. 你可能只需要几种方法。

@Jefffrey's answer already covers the relevant code, if for bitset<> . @ Jefffrey的答案已经涵盖了相关代码,如果是bitset<>

[ I am not familiar with boost::dynamic_bitset<> and what it might give over vector .] [我不熟悉boost::dynamic_bitset<>以及它可能会给vector 。]

One further thought is whether it might be convenient for you to work with quads of letters, a quad nicely filling a char in space. 还有一个想法是,你是否可以方便地使用四边形字母,四边形很好地填充空间中的字符。

class Genome
{
public:
    enum class Letter {A,C,G,T};
    Genome(const std::string& source)
    {
        code_.resize(source.size() * 2);
        for (unsigned index = 0; index != source.size(); ++index)
        {
            char text = source[index];
            Letter letter = textToLetter(text);
            set(index, letter);
        }
    }  
    static Letter textToLetter(char text)
    {
        // Or search through the array `letterText`.
        // Or come up with a neat but unintelligible one liner ...
        Letter letter = Letter::A;
        switch (text)
        {
        case 'A':
            letter = Letter::A;
            break;
        case 'C':
            letter = Letter::C;
            break;
        case 'G':
            letter = Letter::G;
            break;
        case 'T':
            letter = Letter::T;
            break;
        default:
            // Invalid - handle error.
            break;
        }
        return letter;
    }
    static char letterToText(Letter l) 
    {
        return letterText[(unsigned)l];
    }
    // Add bounds checking
    Letter get(unsigned index) const
    {
        unsigned distance = index * 2;
        char numeric = code_[distance] + code_[distance + 1] * 2;
        return Letter(numeric);
    }
    // Add bounds checking
    void set(unsigned index, Letter value)
    {
        unsigned distance = index * 2;
        bool low = (unsigned)value & 1;
        bool high = (bool)((unsigned)value & 2);
        code_[distance] = low;
        code_[distance + 1]  = high;
    }
    unsigned size()
    {
        return code_.size() / 2;
    }
    // Extend by numLetters, initially set to 'A'
    void extend(unsigned numLetters)
    {
        code_.resize(code_.size() + numLetters * 2);
    }
private:

    static char letterText[4];
    std::vector<bool> code_;
};

char Genome::letterText [4] = { 'A', 'C', 'G', 'T' };

int main()
{
    Genome g("GATT");
    g.extend(3);
    g.set(5, Genome::Letter::C);
    for (unsigned i = 0; i != g.size(); ++i)
        std::cout << Genome::letterToText(g.get(i));
    std::cout << std::endl;
    return 0;
}

You have two choices. 你有两个选择。

Given: 鉴于:

enum class nucleobase { a, c, g, t };

You have two choices. 你有两个选择。 You can: 您可以:

  • use a single std::bitset and play with indexing 使用单个std::bitset并使用索引std::bitset
  • use std::bitset in combination with another container std::bitset与另一个容器结合使用

For the first, you can just define a couple of functions that target the correct number of bits per set/get: 首先,您可以定义一些针对每组/ get的正确位数的函数:

template<std::size_t N>
void set(std::bitset<N>& bits, std::size_t i, nucleobase x) {
    switch (x) {
        case nucleobase::a: bits.set(i * 2, 0); bits.set(i * 2 + 1, 0); break;
        case nucleobase::c: bits.set(i * 2, 0); bits.set(i * 2 + 1, 1); break;
        case nucleobase::g: bits.set(i * 2, 1); bits.set(i * 2 + 1, 0); break;
        case nucleobase::t: bits.set(i * 2, 1); bits.set(i * 2 + 1, 1); break;
    }
}

template<std::size_t N>
nucleobase get(const std::bitset<N>& bits, std::size_t i) {
    if (!bits[i * 2])
        if (!bits[i * 2 + 1]) return nucleobase::a;
        else                  return nucleobase::c;
    else
        if (!bits[i * 2 + 1]) return nucleobase::g;
        else                  return nucleobase::t;
}

Live demo 现场演示

The above is just an example and a terrible one (it's almost 4AM here and I really need to sleep). 上面只是一个例子而且是一个可怕的例子(这里差不多是凌晨4点,我真的需要睡觉)。

For the second you just need to map alleles and bits: 对于第二个,您只需要映射等位基因和位:

bit_pair bits_for(nucleobase x) {
    switch (x) {
        case nucleobase::a: return bit_pair("00"); break;
        case nucleobase::c: return bit_pair("10"); break;
        case nucleobase::g: return bit_pair("01"); break;
        case nucleobase::t: return bit_pair("11"); break;
    }
}

nucleobase nucleobase_for(bit_pair x) {
    switch (x.to_ulong()) {
        case 0: return nucleobase::a; break;
        case 1: return nucleobase::c; break;
        case 2: return nucleobase::g; break;
        case 3: return nucleobase::t; break;
        default: return nucleobase::a; break; // just for the warning
    }
}

Live demo 现场演示

Of course if you need runtime length you can just use boost::dynamic_bitset and std::vector . 当然,如果你需要运行时长度,你可以使用boost::dynamic_bitsetstd::vector

Here's what I use for fixed-length k-mers. 这是我用于固定长度k-mers的内容。

#include <cstdint>
#include <cstdlib>
#include <ostream>

enum class nucleotide { A, C, G, T };

inline std::ostream&
operator<<(std::ostream& pOut, nucleotide pNt)
{
    switch (pNt) {
        case nucleotide::A: pOut << 'A'; break;
        case nucleotide::C: pOut << 'C'; break;
        case nucleotide::G: pOut << 'G'; break;
        case nucleotide::T: pOut << 'T'; break;
    }
    return pOut;
}

class kmer_base;

class nucleotide_proxy {
public:
    operator nucleotide() const {
        return nucleotide((*mWord >> (mPosition * 2)) & 3);
    };

    nucleotide_proxy& operator=(nucleotide pNt) {
        uint64_t word = *mWord;
        word &= ~(uint64_t(3) << (mPosition*2));
        word |= uint64_t(pNt) << (mPosition*2);
        *mWord = word;

        return *this;
    };

private:
    friend class kmer_base;

    nucleotide_proxy(uint64_t* pWord, uint8_t pPosition)
        : mWord(pWord), mPosition(pPosition)
    {
    }

    uint64_t* mWord;
    uint8_t mPosition;
};


class kmer_base {
protected:
    nucleotide_proxy access(uint64_t* pWord, size_t pPosition)
    {
        return nucleotide_proxy(pWord + (pPosition / 32), (pPosition & 31));
    }

    const nucleotide_proxy access(uint64_t* pWord, size_t pPosition) const
    {
        return nucleotide_proxy(pWord + (pPosition / 32), (pPosition & 31));
    }
};


template<int K>
class kmer : public kmer_base
{
    enum { Words = (K + 31) / 32 };
public:
    nucleotide_proxy operator[](size_t pOutdex) {
        return access(mWords, pOutdex);
    }

    const nucleotide_proxy operator[](size_t pOutdex) const {
        return access(mWords, pOutdex);
    }

private:
    uint64_t mWords[Words];
};

Extending this to dynamic-length k-mere is left as an exercise; 将其延伸到动态长度的k-mere是一种练习; it's pretty easy once you have nucleotide_proxy at your disposal. 一旦你拥有了nucleotide_proxy就很容易了。 Implementing the reverse complement operator efficiently is also left as an exercise. 有效地实施反向补充算子也是一种练习。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM