简体   繁体   English

HBase:get(...)vs scan和in-memory table

[英]HBase : get(…) vs scan and in-memory table

I'm executing MR over HBase. 我正在执行MR over HBase。

The business logic in the reducer heavily accesses two tables, say T1(40k rows) and T2(90k rows). reducer中的业务逻辑大量访问两个表,比如T1(40k行)和T2(90k行)。 Currently, I'm executing the following steps : 目前,我正在执行以下步骤:

1.In the constructor of the reducer class, doing something like this : 1.在reducer类的构造函数中,执行以下操作:

HBaseCRUD hbaseCRUD = new HBaseCRUD();

HTableInterface t1= hbaseCRUD.getTable("T1",
                            "CF1", null, "C1", "C2");
HTableInterface t2= hbaseCRUD.getTable("T2",
                            "CF1", null, "C1", "C2");

In the reduce(...) 在减少(...)

 String lowercase = ....;

/* Start : HBase code */
/*
 * TRY using get(...) on the table rather than a
 * Scan!
 */
Scan scan = new Scan();
scan.setStartRow(lowercase.getBytes());
scan.setStopRow(lowercase.getBytes());

/*scan will return a single row*/
ResultScanner resultScanner = t1.getScanner(scan);

for (Result result : resultScanner) {
 /*business logic*/
}

Though not sure if the above code is sensible in first place, I have a question - would a get(...) provide any performance benefit over the scan? 虽然不确定上面的代码是否在第一时间是合理的,但我有一个问题 - 获得(...)会在扫描中提供任何性能优势吗?

Get get = new Get(lowercase.getBytes());
Result getResult = t1.get(get);

Since T1 and T2 will be read-only(mostly), I think if kept in-memory, the performance will improve. 由于T1和T2将是只读的(大部分),我认为如果保留在内存中,性能将会提高。 As per HBase doc., I will have to re-create the tables T1 and T2. 根据HBase doc。,我将不得不重新创建表T1和T2。 Please verify the correctness of my understanding : 请验证我理解的正确性:

public void createTables(String tableName, boolean readOnly,
            boolean blockCacheEnabled, boolean inMemory,
            String... columnFamilyNames) throws IOException {
        // TODO Auto-generated method stub

        HTableDescriptor tableDesc = new HTableDescriptor(tableName);
        /* not sure !!! */
        tableDesc.setReadOnly(readOnly);

        HColumnDescriptor columnFamily = null;

        if (!(columnFamilyNames == null || columnFamilyNames.length == 0)) {

            for (String columnFamilyName : columnFamilyNames) {

                columnFamily = new HColumnDescriptor(columnFamilyName);
                /*
                 * Start : Do these steps ensure that the column
                 * family(actually, the column data) is in-memory???
                 */
                columnFamily.setBlockCacheEnabled(blockCacheEnabled);
                columnFamily.setInMemory(inMemory);
                /*
                 * End : Do these steps ensure that the column family(actually,
                 * the column data) is in-memory???
                 */

                tableDesc.addFamily(columnFamily);
            }
        }

        hbaseAdmin.createTable(tableDesc);
        hbaseAdmin.close();
    }

Once done : 完成后:

  1. How to verify that the columns are in-memory(of course, the describe statement and the browser reflect it) and accessed from there and not the disk? 如何验证列是在内存中(当然,describe语句和浏览器反映它)并从那里访问而不是磁盘?
  2. Is the from-memory or from-disk read transparent to the client? 来自内存或来自磁盘的读取对客户端是否透明? In simple words, do I need to change the HTable access code in my reducer class? 简单来说,我是否需要在reducer类中更改HTable访问代码? If yes, what are the changes? 如果是,有什么变化?

would a get(...) provide any performance benefit over the scan? get(...)会在扫描中提供任何性能优势吗?

Get operates directly on a particular row identified by the rowkey passed as a parameter to the the Get instance. 获取直接操作在由作为参数传递给Get实例的rowkey标识的特定行上。 While Scan operates on all the rows, if you haven't used range query by providing start and end rowkeys to your Scan instance. 虽然扫描对所有行都有效,但如果您没有通过向Scan实例提供开始和结束行键来使用范围查询。 Clearly it is more efficient if you know it beforehand which row to operate on. 显然,如果事先知道要操作哪一行,效率会更高。 You can directly go there and perform the desired operation. 您可以直接去那里并执行所需的操作。

How to verify that the columns are in-memory(of course, the describe statement and the browser reflect it) and accessed from there and not the disk? 如何验证列是在内存中(当然,describe语句和浏览器反映它)并从那里访问而不是磁盘?

You can use isInMemory() method provided by HColumnDescriptor to verify if a particular CF is in-memory or not. 可以使用由HColumnDescriptor提供isInMemory()方法来验证是否一个特定的CF为在内存中或没有。 But, you cannot find out that the entire table is in memory and whether fetch is happening from disk or the memory. 但是,你无法发现整个表都在内存中,是否从磁盘或内存中进行了提取。 Although in-memory blocks have the highest priority, but it is not 100% sure that everything is in-memory all the time. 虽然内存块具有最高优先级,但并不是100%确定所有内容始终都在内存中。 One important thing here is that data is persisted to disk even in case of in-memory CF. 这里一个重要的事情是,即使在内存CF的情况下,数据也会持久存储到磁盘。

Is the from-memory or from-disk read transparent to the client? 来自内存或来自磁盘的读取对客户端是否透明? In simple words, do I need to change the HTable access code in my reducer class? 简单来说,我是否需要在reducer类中更改HTable访问代码? If yes, what are the changes? 如果是,有什么变化?

Yes. 是。 It is totally transparent. 它是完全透明的。 You don't have to do anything extra. 你不需要做任何额外的事情。

  1. There is no substantial difference between these as far as implementation is concerned. 就实施而言,这些之间没有实质性差异。 They both are identical to client. 它们都与客户相同。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM