将带有默认时间戳的Spark RDD中的值放入同一HBase列

Question

I'm using Spark and trying to write the RDD to the HBase table. 我正在使用Spark并尝试将RDD写入HBase表。

Here the sample code: 这里是示例代码：

public static void main(String[] args) {
// ... code omitted
    JavaPairRDD<ImmutableBytesWritable, Put> hBasePutsRDD = rdd
            .javaRDD()
            .flatMapToPair(new MyFunction());

    hBasePutsRDD.saveAsNewAPIHadoopDataset(job.getConfiguration());
}

private class MyFunction implements
            PairFlatMapFunction<Row, ImmutableBytesWritable, Put> {

    public Iterable<Tuple2<ImmutableBytesWritable, Put>> call(final Row row) 
            throws Exception {

        List<Tuple2<ImmutableBytesWritable, Put>> puts = new ArrayList<>();
        Put put = new Put(getRowKey(row));
        String value = row.getAs("rddFieldName");

        put.addColumn("CF".getBytes(Charset.forName("UTF-8")), 
                      "COLUMN".getBytes(Charset.forName("UTF-8")),
                      value.getBytes(Charset.forName("UTF-8")));

        return Collections.singletonList(
            new Tuple2<>(new ImmutableBytesWritable(getRowKey(row)), put));
    }
}

If I manually set the timestamp like this: 如果我这样手动设置时间戳：

put.addColumn("CF".getBytes(Charset.forName("UTF-8")), 
              "COLUMN".getBytes(Charset.forName("UTF-8")),
              manualTimestamp,
              value.getBytes(Charset.forName("UTF-8")));

everything works fine and I have as many cell versions in HBase column "COLUMN" as there are number of different values in RDD. 一切正常，并且HBase列“ COLUMN”中的单元格版本与RDD中的许多不同值一样多。

But if I do not, there is only one cell version. 但是，如果我不这样做，那么只有一个单元格版本。

In another words, if there are multiple Put objects with the same column family and column, different values and default timestamp, the only one value will be inserted and another will be omitted (maybe overwritten). 换句话说，如果存在多个具有相同列族和列，不同值和默认时间戳记的Put对象，则将仅插入一个值，而忽略另一个值（可能会覆盖）。

Could you please help me understand how it works ( saveAsNewAPIHadoopDataset especially) in this case and how can I modify the code to insert values and do not a timestamp manually. 在这种情况下，能否请您帮助我了解它的工作原理（尤其是saveAsNewAPIHadoopDataset ），以及如何修改代码以插入值而无需手动设置时间戳。

Answer 1

They are overwritten when you don't use your timestamp. 不使用时间戳记时，它们将被覆盖。 Hbase needs a unique key for every value, so real key for every value is Hbase对于每个值都需要一个唯一的键，因此每个值的真实键是

rowkey + column family + column key + timestamp => value

When you don't use timestamp, and they are inserted as bulk, many of them get same timestamp as hbase can insert multiple rows in same millisecond. 当您不使用时间戳记并且将它们作为批量插入时，它们中的许多都将获得相同的时间戳记，因为hbase可以在同一毫秒内插入多行。 So you need a custom timestamp for every same column key values. 因此，您需要为每个相同的列键值设置自定义时间戳。

I did not understand why you did not want to use custom timestamp as you said it works already. 我不明白您为什么不想使用自定义时间戳，因为您说它已经可以使用了。 If you think it will use extra space in database, hbase already use timestamp even if you don't give in Put command. 如果您认为它将在数据库中使用额外的空间，则即使您不输入Put命令，hbase也已经使用了时间戳。 So nothing changes when you use manual timestamp, please use it. 因此，当您使用手动时间戳记时，没有任何变化，请使用它。

将带有默认时间戳的Spark RDD中的值放入同一HBase列

问题描述

1 个解决方案

解决方案1
3 已采纳 2016-07-08 07:58:12

将带有默认时间戳的Spark RDD中的值放入同一HBase列

问题描述

1 个解决方案

解决方案1 3 已采纳 2016-07-08 07:58:12

解决方案1
3 已采纳 2016-07-08 07:58:12