简体   繁体   English

Java和Python中对HBase的并行扫描请求具有不同的性能

[英]Parallel scan requests to HBase in Java and Python have different performance

Statement 声明

We have 10 machines HBase cluster and billions of rows inside. 我们有10台机器HBase集群和数十亿行内部。 Every row consists of one column family and ~20 columns. 每行包含一个列族和~20列。 We need perform frequent scan requests which contains start row prefix and end row prefix. 我们需要执行频繁的扫描请求,其中包含起始行前缀和结束行前缀。 Usually every scan returns abount 100 - 10000 rows. 通常每次扫描返回100到10000行。

Because requests can come very often (up to several requests per minute), so performance is preoritized. 因为请求可以经常发生(每分钟多达几个请求),所以性能被预先确定。 Due to system's architecture we want to realize our solution in Python instead of current Java code. 由于系统的架构,我们希望用Python而不是当前的Java代码实现我们的解决方案。 The problem is with Python we obtain 5x-10x worse performance than in Java. 问题在于Python,我们获得的性能比Java差5到10倍。

What now works 什么现在有效

We have Java code which perform scan requests to HBase. 我们有Java代码执行对HBase的扫描请求。 It uses ususal HBase Java API: 它使用ususal HBase Java API:

public List<String> getNumber(Number key) {
    List<String> res = new ArrayList<>();

    String start_key = key.getNumber();
    String next_key = key.getNumber() + "1";
    byte[] prefix_begin = Bytes.toBytes(start_key);
    byte[] prefix_end = Bytes.toBytes(next_key);
    Scan scan = new Scan(prefix_begin, prefix_end);
    ResultScanner scanner = table.getScanner(scan);
    for (Result result : scanner) {
        byte[] row = result.getRow();
        res.add(Bytes.toString(row));
    }

    return res;
}

These queries parallelized with the help of Callable interface and ScheduledThreadPoolExecutor . 这些查询在Callable接口和ScheduledThreadPoolExecutor的帮助下并行化。 The call() method of every callable just run getNumber(Number key) . 每个callable的call()方法只运行getNumber(Number key)

public List<String> getNumbers(List<Number> keys) {
    List<String> res = new ArrayList<String>();

    List<Callables.CallingCallable> callables = new ArrayList();
    for (Number source : keys) {
        callables.add(new Callables.CallingCallable(this, source));
    }

    Object futures = new ArrayList();
    ScheduledThreadPoolExecutor executor = new ScheduledThreadPoolExecutor(24);

    try {
        futures = executor.invokeAll(callables);
    } catch (InterruptedException ex) {
    }

    executor.shutdown();
}

This works pretty good and allows achieve following performance: 这非常好,可以实现以下性能:

  • 1.5 - 2.0 sec per single scan and 每次扫描1.5 - 2.0秒
  • 5.0 - 8.0 sec per 100 parallelized scans 每100次并行扫描5.0 - 8.0秒

What we try 我们尝试什么

We try to implement similar solution in Python with the help of Happybase library: 我们尝试在Happybase库的帮助下在Python中实现类似的解决方案:

@staticmethod
def execute_query(key, table_name, con_pool):
        items = []
        with con_pool.connection() as connection:
            table = happybase.Table(table_name, connection)
            [row_start, row_end] = get_start_and_end_row(key)
            selected_rows = table.scan(row_start=row_start, row_stop=row_end)
            for key, data in selected_rows:
                items.append(Item(data))
        return items

@staticmethod
def execute_in_parallel(table_name, hbase_host, hbase_port, keys):
        pool = ThreadPool(24)
        con_pool = happybase.ConnectionPool(size=24, host=hbase_host, port=hbase_port)
        execute_query_partial = partial(execute_query, table_name=table_name, con_pool=con_pool)
        result_info = pool.map_async(execute_query_partial, keys, chunksize=1)
        result = result_info.get()

Achieved performance: 达到表现:

  • 2.0 - 3.0 sec per single scan and 每次扫描2.0 - 3.0秒
  • 30 - 55 sec per 100 parallelized scans 每100次并行扫描30-55

As we can see performance of single scan is very similar. 我们可以看到单次扫描的性能非常相似。 But parallelized tasks in Python are much slower. 但Python中的并行化任务要慢得多。

Any ideas why does it happen? 任何想法为什么会发生? Maybe some issues with our Python/Happybase code? 也许我们的Python / Happybase代码存在一些问题? Or performance of HBase Thrift server (which HappyBase uses to connect to HBase)? 或HBase Thrift服务器的性能(HappyBase用于连接HBase)?

There is a way by using Jython which allows you to access the java JVM and java libraries. 有一种方法可以使用Jython来访问java JVM和java库。 With this you can write python and java in the same source file. 有了这个,你可以在同一个源文件中编写python和java。 Then the code is compiled into java bytecode for the JVM. 然后将代码编译为JVM的java字节码。 This should give the same performance as Jython is written in java code and so you won't have to write in pure java. 这应该具有与Jython在java代码中编写的相同的性能,因此您不必使用纯java编写。

Java benchmark vs Python is much higher. Java基准测试与Python相比要高得多。 Here is a website that shows performance between java and python. 这是一个显示java和python之间性能的网站。

http://benchmarksgame.alioth.debian.org/u64q/python.html http://benchmarksgame.alioth.debian.org/u64q/python.html

and here is the website for jython: http://www.jython.org/ 这里是jython的网站: http ://www.jython.org/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM