简体   繁体   English

使用 DSBulk 加载数据时元组的正确 CSV 格式是什么?

[英]What is the correct CSV format for tuples when loading data with DSBulk?

I recently started using Cassandra for my new project and doing some load testing.我最近开始为我的新项目使用 Cassandra 并进行一些负载测试。

I have a scenario where I'm doing dsbulk load using CSV like this,我有这样一个场景,我正在使用 CSV 进行 dsbulk 加载,

$ dsbulk load -url <csv path> -k <keyspace> -t <table> -h <host> -u <user> -p <password> -header true -cl LOCAL_QUORUM

My CSV file entries looks like this,我的 CSV 文件条目如下所示,

userid birth_year created_at                  freq
1234   1990       2023-01-13T23:27:15.563Z    {1234:{"(1, 2)": 1}}

Column types,列类型,

userid bigint PRIMARY KEY,
birth_year int,
created_at timestamp,
freq map<bigint, frozen<map<frozen<tuple<tinyint, smallint>>, smallint>>>

The issue is, for column freq , I try different ways of setting the value in csv like below, but not able to insert the row using dsbulk问题是,对于列freq ,我尝试了不同的方法来设置 csv 中的值,如下所示,但无法使用dsbulk插入行

  1. Let's say if I set freq as {1234:{[1, 2]: 1}} , com.datastax.oss.dsbulk.workflow.commons.schema.InvalidMappingException: Could not map field freq to variable freq;假设我将 freq 设置为{1234:{[1, 2]: 1}} , com.datastax.oss.dsbulk.workflow.commons.schema.InvalidMappingException: Could not map field freq to variable freq; conversion from Java type java.lang.String to CQL type Map(BIGINT => Map(Tuple(TINYINT, SMALLINT) => SMALLINT, not frozen), not frozen) failed for raw value: {1234:{[1,2]: 1}} Caused by: java.lang.IllegalArgumentException: Could not parse '{1234:{[1, 2]: 1}}' as Json Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected character ('[' (code 91)): was expecting either valid name character (for unquoted name) or double-quote (for quoted) to start field name at [Source: (String)“{1234:{[1, 2]: 1}}“;从 Java 类型 java.lang.String 转换为 CQL 类型 Map(BIGINT => Map(Tuple(TINYINT, SMALLINT) => SMALLINT, not frozen), not frozen) 原始值失败:{1234:{[1,2] :1}} 引起:java.lang.IllegalArgumentException:无法将'{1234:{[1, 2]:1}}'解析为Json 引起:com.fasterxml.jackson.core.JsonParse'Exception:意外字符['(代码 91)):期望有效的名称字符(对于不带引号的名称)或双引号(对于带引号的名称)在 [Source: (String)“{1234:{[1, 2]: 1 }}“; line: 1, column: 9]行:1,列:9]

  2. If I set freq as {\"1234\":{\"[1, 2]\":1}} ,如果我将频率设置为{\"1234\":{\"[1, 2]\":1}}
    java.lang.IllegalArgumentException: Expecting record to contain 4 fields but found 5. java.lang.IllegalArgumentException:期望记录包含 4 个字段但找到了 5 个。

  3. If I set freq as {1234:{"[1, 2]": 1}} or {1234:{"(1, 2)": 1}} ,如果我将频率设置为{1234:{"[1, 2]": 1}} or {1234:{"(1, 2)": 1}}
    Source: 1234,80,2023-01-13T23:27:15.563Z,“{1234:{“”[1, 2]“”: 1}}” java.lang.IllegalArgumentException: Expecting record to contain 4 fields but found 5.来源:1234,80,2023-01-13T23:27:15.563Z,“{1234:{“”[1, 2]“”: 1}}” java.lang.IllegalArgumentException: 预期记录包含 4 个字段但已找到5.

But in COPY FROM TABLE command, the value for freq {1234:{[1, 2]:1}} inserts into DB without any error, the value in DB looks like this {1234: {(1, 2): 1}}但是在COPY FROM TABLE命令中, freq {1234:{[1, 2]:1}}的值插入到 DB 中没有任何错误,DB 中的值看起来像这样{1234: {(1, 2): 1}}

I guess the JSON not accepting array(tuple) as key when I try with dsbulk?我猜当我尝试使用 dsbulk 时 JSON 不接受数组(元组)作为键? Can someone advise me what's the issue and how to fix this?有人可以告诉我这是什么问题以及如何解决这个问题吗? Appreciate your help.感谢你的帮助。

When loading data using the DataStax Bulk Loader (DSBulk) , the CSV format for CQL tuple type is different from the format used by the COPY... FROM command because DSBulk uses a different parser.使用DataStax Bulk Loader (DSBulk)加载数据时,CQL tuple类型的 CSV 格式与COPY... FROM命令使用的格式不同,因为 DSBulk 使用不同的解析器。

Formatting the CSV data is particularly challenging in your case because the column contains multiple nested CQL collections.格式化 CSV 数据在您的案例中特别具有挑战性,因为该列包含多个嵌套的 CQL collections。

InvalidMappingException

The JSON parser used by DSBulk doesn't accept parentheses () when enclosing tuples. DSBulk 使用的 JSON 解析器在包含元组时不接受括号() It also expects tuples to be enclosed in double quotes " otherwise you'll get errors like:它还希望元组用双引号括起来"否则你会得到如下错误:

com.datastax.oss.dsbulk.workflow.commons.schema.InvalidMappingException: \
  Could not map field ... to variable ...; \
  conversion from Java type ... to CQL type ... failed for raw value: ...
   ...
Caused by: java.lang.IllegalArgumentException: Could not parse '...' as Json
   ...
Caused by: com.fasterxml.jackson.core.JsonParseException: \
  Unexpected character ('(' (code 91)): was expecting either valid name character \
  (for unquoted name) or double-quote (for quoted) to start field name
   ...

IllegalArgumentException

Since values for tuples contain a comma ( , ) as a separator, DSBulk incorrectly parses the rows and it thinks each row contains more fields than expected and throws an IllegalArgumentException , for example:由于元组的值包含逗号 ( , ) 作为分隔符,因此 DSBulk 错误地解析了行,它认为每行包含的字段比预期的多,并抛出IllegalArgumentException ,例如:

java.lang.IllegalArgumentException: Expecting record to contain 2 fields but found 3.

Solution解决方案

Just to make it easier, here is the schema for the table I'm using as an example:为了方便起见,这里是我用作示例的表的架构:

CREATE TABLE inttuples (
    id int PRIMARY KEY,
    inttuple map<frozen<tuple<tinyint, smallint>>, smallint>
)

In this example CSV file, I've used the pipe character ( | ) as a delimiter:在此示例 CSV 文件中,我使用 pipe 字符 ( | ) 作为分隔符:

id|inttuple
1|{"[2,3]":4}

Here's another example that uses tabs as the delimiter:这是另一个使用制表符作为分隔符的示例:

id\t      inttuple
1\t       {"[2,3]":4}

Note that you will need to specify the delimiter with either -delim '|'请注意,您需要使用-delim '|'指定分隔符or -delim '\t' when running DSBulk.或运行 DSBulk 时使用-delim '\t' Cheers!干杯!


Please support the Apache Cassandra community by hovering over the tag then click on the Watch tag button.请将鼠标悬停在标签上,然后单击“ Watch tag ”按钮,支持 Apache Cassandra 社区。 Thanks!谢谢!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM