简体   繁体   中英

How to implement a data table with different column data types in C++

I want to implement a data table where the fields may have different types. One field may be a vector of string. Another field may be a vector of float. And the types of the fields are unknown at compile time because I want to be able to construct a data table from a csv file.

How can I do it in C++?

Use boost::variant , which can represent one of a set of types:

std::vector<boost::variant<std::string, float>> values;

You can then apply a visitor to the variant:

struct visitor_t : boost::static_visitor<> {
    void operator()(std::string const& x) const {
        std::cout << "got string: " << x << '\n';
    }

    void operator()(float x) const {
        std::cout << "got float: " << x << '\n';
    }
};

visitor_t visitor;
for (auto&& value : values) {
    boost::apply_visitor(visitor, value);
}

Live Example!

I have tried something similar:

class Component;

class Field : public Component
{
  // Common interface methods
  public:  
    virtual std::string get_field_name() const = 0;
    virtual std::string get_value_as_string() const = 0;
};

class Record : public Component
{
  // Common interface methods
  std::vector< std::unique_ptr<Component> > fields;
};

class Integer_Field : public Field;

The idea is that a Record can contain various fields. The various fields is implemented by a pointer to the Component base class. This allows for a Record to contain sub-records.

You should see Sean Parent's talk on " Inheritance Is the Base Class of Evil ." You can see it in print format here , under "Value Semantics and Concept-based Polymorphism."

He proposed a concept-based object class that defines an interface for elements of a container. Any object that meets the required interface (ie has the required free-standing functions) can be put in the container.

You might be able to pick up the gist from looking at the code sample below (taken from the documentation I linked above).

class object_t {
  public:
    template <typename T>
    object_t(T x) : self_(make_shared<model<T>>(move(x)))
    { }

    friend void draw(const object_t& x, ostream& out, size_t position)
    { x.self_->draw_(out, position); }

  private:
    struct concept_t {
        virtual ~concept_t() = default;
        virtual void draw_(ostream&, size_t) const = 0;
    };
    template <typename T>
    struct model : concept_t {
        model(T x) : data_(move(x)) { }
        void draw_(ostream& out, size_t position) const 
        { draw(data_, out, position); }

        T data_;
    };

   shared_ptr<const concept_t> self_;
};

Your fields would each be one of these object_t types, which would take any type ( std::vector<int> , std::deque<float> , std::string , etc.). You would just need to be sure that whatever methods you want to be supported for object_t (in the example, it's just draw() ) are defined somewhere for your different inputs. This is nice because it gives you value semantics and also makes it very simple to add new types.

Because the data types are not known at compile time, you must construct and store that information at runtime. For each field of each row, there are potentially three pieces of information to encode:

  1. The type of the field.
  2. The value of the field (must match the type specified in #1)
  3. (Optional) The name of the field.

You could use polymorphic types, boost::any , or boost::variant (or std::any or std::variant , as defined in C++17), but a more elegant, robust and memory-efficient solution would take advantage of the fact that every row has the same structure.

What you are doing is basically creating a database program. In a database, a schema encodes the structure of the data, but is separate from the data itself. What you want is a way to encode a schema at runtime, something like this:

enum class FieldType {
  // Scalar types:
  Boolean, Integer, FloatingPoint, String,

  // Array types:
  ArrayBit = 0x1000, // This bit set for array types
  Boolean_Array = Boolean | ArrayBit,
  Integer_Array, FloatingPoint_Array, String_Array
};

class FieldSchema {
  FieldType   m_type;
  std::string m_name;  // Optional, if fields are named
  ...
};

class RowSchema {
  std::vector<FieldSchema> m_fields;
  ...
};

A data field itself is simply a union of the possible data types. (Note that putting a string or vector into a union requires C++11 or later.)

union FieldValue {
  bool                     m_boolean;
  int                      m_integer;
  double                   m_floatingpoint;
  std::string              m_string;
  std::vector<bool>        m_boolean_array;
  std::vector<int>         m_integer_array;
  std::vector<double>      m_floatingpoint_array;
  std::vector<std::string> m_string_array;

  // Constructors for each type go here
};

And a data row is simply a vector of data fields, with a pointer to the schema:

class RowValue {
  RowSchema*               m_schama;
  std::vector<FieldValue>  m_fields;
  ...
};

Now, for each CSV file, there will be one RowSchema object for the entire table, but one RowValue object for each row. All of the RowValue objects for a given file will share (point to) the same RowSchema object. The process for reading a CSV file is:

  1. Determine the structure (schema) for all the rows (possibly by reading the first row).
  2. Build a RowSchema object reflecting that structure.
  3. For each row: Create a RowValue object pointing to the RowSchema from step 2; read each field into the correct data type as specified in the corresponding FieldSchema ; and append the value to the end of the m_fields array using emplace_back .

Since this is a Stack Overflow answer and not a textbook about C++11, I won't go into detail on how to construct a union that contains a string or a vector, nor will I get into how to use vector::emplace_back . All of this information is available in other places (eg, cppreference.com ). This can also be done in C++03, with additional work to simulate a union of non-trivial types (eg, by using boost::variant ).

Obviously, I've left out a lot of details. One caution I'll mention is that the destructor for FieldValue is insufficient to destroy a string or vector contained within the union. Instead, you'll have to look up the data type in the schema and explicitly call the correct destructor for the field. The destructor for RowValue must therefore iterate over the fields and destroy each individually. A C++17 std::variant (or boost::variant ) would help here, at the cost of additional memory.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM