Java和Python中對HBase的並行掃描請求具有不同的性能

Question

聲明

我們有10台機器HBase集群和數十億行內部。 每行包含一個列族和~20列。 我們需要執行頻繁的掃描請求，其中包含起始行前綴和結束行前綴。 通常每次掃描返回100到10000行。

因為請求可以經常發生（每分鍾多達幾個請求），所以性能被預先確定。 由於系統的架構，我們希望用Python而不是當前的Java代碼實現我們的解決方案。 問題在於Python，我們獲得的性能比Java差5到10倍。

什么現在有效

我們有Java代碼執行對HBase的掃描請求。 它使用ususal HBase Java API：

public List<String> getNumber(Number key) {
    List<String> res = new ArrayList<>();

    String start_key = key.getNumber();
    String next_key = key.getNumber() + "1";
    byte[] prefix_begin = Bytes.toBytes(start_key);
    byte[] prefix_end = Bytes.toBytes(next_key);
    Scan scan = new Scan(prefix_begin, prefix_end);
    ResultScanner scanner = table.getScanner(scan);
    for (Result result : scanner) {
        byte[] row = result.getRow();
        res.add(Bytes.toString(row));
    }

    return res;
}

這些查詢在Callable接口和ScheduledThreadPoolExecutor的幫助下並行化。 每個callable的call()方法只運行getNumber(Number key) 。

public List<String> getNumbers(List<Number> keys) {
    List<String> res = new ArrayList<String>();

    List<Callables.CallingCallable> callables = new ArrayList();
    for (Number source : keys) {
        callables.add(new Callables.CallingCallable(this, source));
    }

    Object futures = new ArrayList();
    ScheduledThreadPoolExecutor executor = new ScheduledThreadPoolExecutor(24);

    try {
        futures = executor.invokeAll(callables);
    } catch (InterruptedException ex) {
    }

    executor.shutdown();
}

這非常好，可以實現以下性能：

每次掃描1.5 - 2.0秒和
每100次並行掃描5.0 - 8.0秒

我們嘗試什么

我們嘗試在Happybase庫的幫助下在Python中實現類似的解決方案：

@staticmethod
def execute_query(key, table_name, con_pool):
        items = []
        with con_pool.connection() as connection:
            table = happybase.Table(table_name, connection)
            [row_start, row_end] = get_start_and_end_row(key)
            selected_rows = table.scan(row_start=row_start, row_stop=row_end)
            for key, data in selected_rows:
                items.append(Item(data))
        return items

@staticmethod
def execute_in_parallel(table_name, hbase_host, hbase_port, keys):
        pool = ThreadPool(24)
        con_pool = happybase.ConnectionPool(size=24, host=hbase_host, port=hbase_port)
        execute_query_partial = partial(execute_query, table_name=table_name, con_pool=con_pool)
        result_info = pool.map_async(execute_query_partial, keys, chunksize=1)
        result = result_info.get()

達到表現：

每次掃描2.0 - 3.0秒
每100次並行掃描30-55 秒

我們可以看到單次掃描的性能非常相似。 但Python中的並行化任務要慢得多。

任何想法為什么會發生？ 也許我們的Python / Happybase代碼存在一些問題？ 或HBase Thrift服務器的性能（HappyBase用於連接HBase）？

Answer 1

有一種方法可以使用Jython來訪問java JVM和java庫。 有了這個，你可以在同一個源文件中編寫python和java。 然后將代碼編譯為JVM的java字節碼。 這應該具有與Jython在java代碼中編寫的相同的性能，因此您不必使用純java編寫。

Java基准測試與Python相比要高得多。 這是一個顯示java和python之間性能的網站。

http://benchmarksgame.alioth.debian.org/u64q/python.html

這里是jython的網站： http ：//www.jython.org/

Java和Python中對HBase的並行掃描請求具有不同的性能

問題描述

聲明

什么現在有效

我們嘗試什么

1 個解決方案

解決方案1
1 已采納 2016-01-13 01:17:31

Java和Python中對HBase的並行掃描請求具有不同的性能

問題描述

聲明

什么現在有效

我們嘗試什么

1 個解決方案

解決方案1 1 已采納 2016-01-13 01:17:31

解決方案1
1 已采納 2016-01-13 01:17:31