简体   繁体   English

Apache Ignite 插入非常慢

[英]Apache Ignite inserts extremely slow

I'm attempting to load a large matrix into an Apache Ignite master node running in AWS.我正在尝试将一个大矩阵加载到在 AWS 中运行的 Apache Ignite 主节点中。 The EC2 instance has 128GB of memory and 512GB of disk space. EC2 实例具有 128GB 内存和 512GB 磁盘空间。

The matrix is a CSV with 50,000 columns and 15,000 rows.该矩阵是具有 50,000 列和 15,000 行的 CSV。

The loading is extremely slow - the first 150 inserts batch together and take over 30 minutes to work.加载速度非常慢 - 前 150 个插入批次一起工作需要 30 多分钟。 I am using the Python Thin Client我正在使用 Python 瘦客户端

import pandas as pd
import pyignite
from pyignite import Client

client = Client()

client.connect('127.0.0.1', 10800)
print('deleting records...')
client.sql('DELETE FROM full_test_table')
df = pd.read_csv('exon.csv')

col = list(df)
col = col[1:]

names = ', '.join('"' + item + '"' for item in col)
names = 'name, ' + names
#print(names)

for index, row in df.iterrows():
    print('inserting for {0}'.format(str(row[0])))
    row[0] = '\"{0}\"'.format(row[0])

    row[0] = str(index)

    values = ', '.join(str(item) for item in row)
    sql = 'INSERT INTO full_test_table ({0}) VALUES({1})'.format(names, values)
    client.sql(sql)

I would like to use Python to load the data, as I'm more familiar with that than Java.我想使用 Python 来加载数据,因为我比 Java 更熟悉它。 This seems unreasonably slow to me - even PostgreSQL can take these inserts in seconds.这对我来说似乎太慢了——即使是 PostgreSQL 也可以在几秒钟内完成这些插入。 What's the issue?有什么问题?

I've tried the COPY command from CSV as well - that doesn't seem to work any faster.我也尝试过来自 CSV 的 COPY 命令 - 这似乎并没有更快地工作。

As of Ignite 2.7, Python Thin Client, as well as other thin clients, use one of the server nodes as a proxy - usually, the one you set in the connection string.从 Ignite 2.7 开始,Python 瘦客户端以及其他瘦客户端使用服务器节点之一作为代理 - 通常是您在连接字符串中设置的那个。 The proxy receives all the requests from the client and directs them to the rest of the servers if needed.代理接收来自客户端的所有请求,并在需要时将它们定向到其余的服务器。 Also, the proxy sends result sets back to the client.此外,代理将结果集发送回客户端。 So, the proxy might be a bottleneck in your cases as well as overall network throughput.因此,代理可能是您的情况以及整体网络吞吐量的瓶颈。 Check that the proxy server doesn't overutilize CPUs and doesn't have any issues related to garbage collection or memory utilization .检查代理服务器没有过度使用 CPU 并且没有任何与垃圾收集或内存使用相关的问题 The proxy won't be needed in Ignite 2.8 any longer. Ignite 2.8 中不再需要代理。

Anyway, the fastest way to preload data in Ignite is with the usage of IgniteStreaming APIs.无论如何,在 Ignite 中预加载数据的最快方法是使用 IgniteStreaming API。 Those are not available for Python yet but a Java application is pretty straightforward.这些还不适用于 Python,但 Java 应用程序非常简单。 You can use this example as a reference by putting your records into the streamer with key-value APIs.您可以将此示例用作参考,方法是将您的记录放入带有键值 API 的流送器中。

If you'd like to continue using SQL INSERTS then use either JDBC or ODBC driver together with SET STREAMING command.如果您想继续使用 SQL INSERTS,那么将 JDBC 或 ODBC 驱动程序与SET STREAMING命令一起使用。

I have just tried it from Java, and I can see around 25 inserts per second from JDBC.我刚刚从 Java 中尝试过,我可以看到每秒大约 25 次插入来自 JDBC。 It is not a horribly high number, but it is much better than 30 minutes that you are showing.这不是一个非常高的数字,但比你展示的 30 分钟要好得多。 Maybe it is a Python client thing.也许它是 Python 客户端的事情。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM