简体   繁体   English

您能用Google的协议缓冲区格式表示CSV数据吗?

[英]Can you represent CSV data in Google's Protocol Buffer format?

I've recently found out about protocol buffers and was wondering if they could be applied to my specific problem. 我最近发现了协议缓冲区,并想知道它们是否可以应用于我的具体问题。

Basically I have some CSV data that I need to convert to a more compact format for storage as some of the files are several gig. 基本上我有一些CSV数据,我需要转换为更紧凑的格式存储,因为一些文件是几个演出。

Each field in the CSV has a header, and there are only two types, strings and decimals (because sometimes there are alot of significant digits and I need to handle all numbers the same way). CSV中的每个字段都有一个标题,只有两种类型,字符串和小数(因为有时会有很多有效数字,我需要以相同的方式处理所有数字)。 But each file will have different column names for each field. 但是每个文件的每个字段都有不同的列名。

As well as capturing the original CSV data I need to be able to add extra information to the file before saving. 除了捕获原始CSV数据,我还需要能够在保存之前向文件中添加额外信息。 And I was hoping to make this future proof by handling different file versions. 我希望通过处理不同的文件版本来证明这一点。

So, is it possible to use protocol buffers to capture a random number of randomly named columns of data, like a CSV file? 那么,是否可以使用协议缓冲区来捕获随机数量的随机命名数据列,如CSV文件?

Well, it's certainly representable. 嗯,它肯定是可以代表的。 Something like: 就像是:

message CsvFile {
    repeated CsvHeader header = 1;
    repeated CsvRow row = 2;
}

message CsvHeader {
    require string name = 1;
    require ColumnType type = 2;
}

enum ColumnType {
    DECIMAL = 1;
    STRING = 2;
}

message CsvRow {
    repeated CsvValue value = 1;
}

// Note that the column is implicit based on position within row    
message CsvValue {
    optional string string_value = 1;
    optional Decimal decimal_value = 2;
}

message Decimal {
    // However you want to represent it (there are various options here)
}

I'm not sure how much benefit it will provide, mind you... You can certainly add more information (add to the CsvFile message) and future proofing is in the "normal PB way" - only add optional fields, etc. 我不确定它会提供多少好处,请注意......您当然可以添加更多信息(添加到CsvFile消息),以及将来的校对是“正常的PB方式” - 只添加可选字段等。

Well, protobuf-net (my version) is based on regular .NET types, so no (since it won't cope with different schemas all the time). 好吧,protobuf-net(我的版本)基于常规的.NET类型,所以没有(因为它不会一直处理不同的模式)。 But Jon's version might allow dynamic types. 但Jon的版本可能允许动态类型。 Personally, I'd just use CSV and run it through GZipStream - I expect that will be fine for the purpose. 就个人而言,我只是使用CSV并通过GZipStream运行它 - 我希望这样可以达到目的。


Edit: actually, I forgot: protobuf-net does support extensible objects, but you need to be a bit careful... it would depend on the full context, I expect. 编辑:实际上,我忘记了:protobuf-net确实支持可扩展对象,但你需要小心一点......这将取决于完整的上下文,我期待。

Plus Jon's approach of nested data would probably work too. 加上Jon的嵌套数据方法也可能有用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM