简体   繁体   English

HBase MapReduce作业:所有列值均为空

[英]Hbase mapreduce job: all column values are null

I am trying to create a map-reduce job in Java on table from a HBase database. 我正在尝试从HBase数据库中的表上用Java创建一个map-reduce作业。 Using the examples from here and other stuff from the internet, I managed to successfully write a simple row-counter. 使用此处的示例以及互联网上的其他内容,我成功地编写了一个简单的行计数器。 However, trying to write one that actually does something with the data from a column was unsuccessful, since the received bytes are always null. 但是,由于接收到的字节始终为空,因此尝试写一个实际上对列中的数据做某事的方法是不成功的。

A part of my Driver from the job is this: 我的工作驱动程序的一部分是这样的:

/* Set main, map and reduce classes */
job.setJarByClass(Driver.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);

Scan scan = new Scan();
scan.setCaching(500);
scan.setCacheBlocks(false);

/* Get data only from the last 24h */
Timestamp timestamp = new Timestamp(System.currentTimeMillis());
try {
    long now = timestamp.getTime();
    scan.setTimeRange(now - 24 * 60 * 60 * 1000, now);
} catch (IOException e) {
    e.printStackTrace();
}

/* Initialize the initTableMapperJob */
TableMapReduceUtil.initTableMapperJob(
        "dnsr",
        scan,
        Map.class,
        Text.class,
        Text.class,
        job);

/* Set output parameters */
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setOutputFormatClass(TextOutputFormat.class);

As you can see, the table is called dnsr . 如您所见,该表称为dnsr My mapper looks like this: 我的映射器如下所示:

@Override
    public void map(ImmutableBytesWritable row, Result value, Context context)
            throws InterruptedException, IOException {
        byte[] columnValue = value.getValue("d".getBytes(), "fqdn".getBytes());
        if (columnValue == null)
            return;

        byte[] firstSeen = value.getValue("d".getBytes(), "fs".getBytes());
        // if (firstSeen == null)
        //     return;

        String fqdn = new String(columnValue).toLowerCase();
        String fs = (firstSeen == null) ? "empty" : new String(firstSeen);

        context.write(new Text(fqdn), new Text(fs));
    }

Some notes: 一些注意事项:

  • the column family from the dnsr table is just d . dnsr表中的列族仅为d There are multiple columns, some of them being called fqdn and fs (firstSeen); 有多个列,其中一些称为fqdnfs (firstSeen);
  • even if the fqdn values appear correctly, the fs are always the "empty" string (I added this check after I had some errors that were saying that you can't convert null to a new string); 即使fqdn值正确显示,fs始终是“空”字符串(我在出现一些错误后说无法将null转换为新字符串后添加了此检查);
  • if I change the fs column name with something else, for example ls (lastSeen), it works; 如果我用其他方式更改fs列名称,例如ls (lastSeen),它会起作用;
  • the reducer doesn't do anything, just outputs everything it receives. 减速器不执行任何操作,仅输出接收到的所有内容。

I created a simple table scanner in javascript that is querying the exact same table and columns and I can clearly see the values are there. 我在javascript中创建了一个简单的表扫描器,用于查询完全相同的表和列,并且可以清楚地看到其中的值。 Using the command line and doing queries manually, I can clearly see the fs values are not null, they are bytes that can e later converted into a string (representing a date). 使用命令行并手动执行查询,我可以清楚地看到fs值不为null,它们是可以稍后转换为字符串(表示日期)的字节。

What can be the problem I'm always getting null? 我总是空的可能是什么问题?

Thanks! 谢谢!

Update: If I get all the columns in a specific column family, I don't receive fs . 更新:如果我得到特定列族中的所有列,则不会收到fs However, a simple scanner implemented in javascript return fs as a column from the dnsr table. 但是,使用javascript实现的简单扫描程序dnsr fs作为dnsr表中的一列返回。

@Override
public void map(ImmutableBytesWritable row, Result value, Context context)
        throws InterruptedException, IOException {
    byte[] columnValue = value.getValue(columnFamily, fqdnColumnName);
    if (columnValue == null)
        return;
    String fqdn = new String(columnValue).toLowerCase();

    /* Getting all the columns */
    String[] cns = getColumnsInColumnFamily(value, "d");
    StringBuilder sb = new StringBuilder();
    for (String s : cns) {
        sb.append(s).append(";");
    }

    context.write(new Text(fqdn), new Text(sb.toString()));
}

I used an answer from here to get all the column names. 我从这里使用答案来获取所有列名。

In the end, I managed to find the 'problem'. 最后,我设法找到了“问题”。 Hbase is a column oriented datastore. Hbase是面向列的数据存储。 Here, data is stored and retrieved in columns and hence can read only relevant data if only some data is required. 此处,数据按列存储和检索,因此如果只需要一些数据,则只能读取相关数据。 Every column family has one or more column qualifiers (columns) and each column has multiple cells. 每个列族都有一个或多个列限定符(列),每列有多个单元格。 The interesting part is that every cell has its own timestamp. 有趣的是每个单元都有自己的时间戳。

Why was this the problem? 为什么这是问题所在? Well, when you are doing a ranged search, only the cells whose timestamp is in that range are returned, so you may end up with a row with "missing cells". 好吧,当您进行远程搜索时,仅返回时间戳在该范围内的单元格,因此您可能会以“缺少单元格”的行结尾。 In my case, I had a DNS record and other fields such as firstSeen and lastSeen . 就我而言,我有一个DNS记录和其他字段,例如firstSeenlastSeen lastSeen is a field that is updated every time I see that domain, firstSeen will remain unchanged after the first occurrence. lastSeen是一个字段,每次我看到该域时都会更新, firstSeen在第一次出现后将保持不变。 As soon as I changed the ranged map reduce job to a simple map reduce job (using all time data), everything was fine (but the job took longer to finish). 我将远程地图缩小作业更改为简单地图缩小作业(使用所有时间数据)后,一切就很好了(但该作业需要更长的时间才能完成)。

Cheers! 干杯!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM