Java和Python中对HBase的并行扫描请求具有不同的性能

Question

声明

我们有10台机器HBase集群和数十亿行内部。 每行包含一个列族和~20列。 我们需要执行频繁的扫描请求，其中包含起始行前缀和结束行前缀。 通常每次扫描返回100到10000行。

因为请求可以经常发生（每分钟多达几个请求），所以性能被预先确定。 由于系统的架构，我们希望用Python而不是当前的Java代码实现我们的解决方案。 问题在于Python，我们获得的性能比Java差5到10倍。

什么现在有效

我们有Java代码执行对HBase的扫描请求。 它使用ususal HBase Java API：

public List<String> getNumber(Number key) {
    List<String> res = new ArrayList<>();

    String start_key = key.getNumber();
    String next_key = key.getNumber() + "1";
    byte[] prefix_begin = Bytes.toBytes(start_key);
    byte[] prefix_end = Bytes.toBytes(next_key);
    Scan scan = new Scan(prefix_begin, prefix_end);
    ResultScanner scanner = table.getScanner(scan);
    for (Result result : scanner) {
        byte[] row = result.getRow();
        res.add(Bytes.toString(row));
    }

    return res;
}

这些查询在Callable接口和ScheduledThreadPoolExecutor的帮助下并行化。 每个callable的call()方法只运行getNumber(Number key) 。

public List<String> getNumbers(List<Number> keys) {
    List<String> res = new ArrayList<String>();

    List<Callables.CallingCallable> callables = new ArrayList();
    for (Number source : keys) {
        callables.add(new Callables.CallingCallable(this, source));
    }

    Object futures = new ArrayList();
    ScheduledThreadPoolExecutor executor = new ScheduledThreadPoolExecutor(24);

    try {
        futures = executor.invokeAll(callables);
    } catch (InterruptedException ex) {
    }

    executor.shutdown();
}

这非常好，可以实现以下性能：

每次扫描1.5 - 2.0秒和
每100次并行扫描5.0 - 8.0秒

我们尝试什么

我们尝试在Happybase库的帮助下在Python中实现类似的解决方案：

@staticmethod
def execute_query(key, table_name, con_pool):
        items = []
        with con_pool.connection() as connection:
            table = happybase.Table(table_name, connection)
            [row_start, row_end] = get_start_and_end_row(key)
            selected_rows = table.scan(row_start=row_start, row_stop=row_end)
            for key, data in selected_rows:
                items.append(Item(data))
        return items

@staticmethod
def execute_in_parallel(table_name, hbase_host, hbase_port, keys):
        pool = ThreadPool(24)
        con_pool = happybase.ConnectionPool(size=24, host=hbase_host, port=hbase_port)
        execute_query_partial = partial(execute_query, table_name=table_name, con_pool=con_pool)
        result_info = pool.map_async(execute_query_partial, keys, chunksize=1)
        result = result_info.get()

达到表现：

每次扫描2.0 - 3.0秒
每100次并行扫描30-55 秒

我们可以看到单次扫描的性能非常相似。 但Python中的并行化任务要慢得多。

任何想法为什么会发生？ 也许我们的Python / Happybase代码存在一些问题？ 或HBase Thrift服务器的性能（HappyBase用于连接HBase）？

Answer 1

有一种方法可以使用Jython来访问java JVM和java库。 有了这个，你可以在同一个源文件中编写python和java。 然后将代码编译为JVM的java字节码。 这应该具有与Jython在java代码中编写的相同的性能，因此您不必使用纯java编写。

Java基准测试与Python相比要高得多。 这是一个显示java和python之间性能的网站。

http://benchmarksgame.alioth.debian.org/u64q/python.html

这里是jython的网站： http ：//www.jython.org/

Java和Python中对HBase的并行扫描请求具有不同的性能

问题描述

声明

什么现在有效

我们尝试什么

1 个解决方案

解决方案1
1 已采纳 2016-01-13 01:17:31

Java和Python中对HBase的并行扫描请求具有不同的性能

问题描述

声明

什么现在有效

我们尝试什么

1 个解决方案

解决方案1 1 已采纳 2016-01-13 01:17:31

解决方案1
1 已采纳 2016-01-13 01:17:31