简体   繁体   English

Neo4J 非常大的管理导入,RAM 有限

[英]Neo4J Very Large Admin Import with limited RAM

I am importing several TB of CSV data into Neo4J for a project I have been working on.我正在为我一直从事的项目将几 TB 的 CSV 数据导入 Neo4J。 I have enough fast storage for the estimated 6.6TiB, however the machine has only 32GB of memory, and the import tool is suggesting 203GB to complete the import.我有足够的快速存储估计 6.6TiB,但是机器只有 memory 的 32GB,导入工具建议 203GB 来完成导入。

When I run the import, I see the following (I assume it exited because it ran out of memory).当我运行导入时,我看到以下内容(我假设它因为内存不足而退出)。 Is there any way I can import this large dataset with the limited amount of memory I have?有什么办法可以用我拥有的有限数量的 memory 导入这个大型数据集吗? Or if not with the limited amount of memory I have, with the maximum ~128GB that the motherboard this machine can support.或者,如果不是我拥有的数量有限的 memory,那么这台机器的主板可以支持的最大 ~128GB。

Available resources:
  Total machine memory: 30.73GiB
  Free machine memory: 14.92GiB
  Max heap memory : 6.828GiB
  Processors: 16
  Configured max memory: 21.51GiB
  High-IO: true

WARNING: estimated number of nodes 37583174424 may exceed capacity 34359738367 of selected record format
WARNING: 14.62GiB memory may not be sufficient to complete this import. Suggested memory distribution is:
heap size: 5.026GiB
minimum free and available memory excluding heap size: 202.6GiB
Import starting 2022-10-08 19:01:43.942+0000
  Estimated number of nodes: 15.14 G
  Estimated number of node properties: 97.72 G
  Estimated number of relationships: 37.58 G
  Estimated number of relationship properties: 0.00 
  Estimated disk space usage: 6.598TiB
  Estimated required memory usage: 202.6GiB

(1/4) Node import 2022-10-08 19:01:43.953+0000
  Estimated number of nodes: 15.14 G
  Estimated disk space usage: 5.436TiB
  Estimated required memory usage: 202.6GiB
.......... .......... .......... .......... ..........   5% ∆1h 38m 2s 867ms
neo4j@79d2b0538617:~/import$

TL:DR;长话短说:博士; Using Periodic Commit , or Transaction Batching 使用 Periodic CommitTransaction Batching

If you're trying to follow the Operations Manual: Neo4j Admin Import , and your csv matches the movies.csv in that example, I would suggest instead doing a more manual USING PERIODIC COMMIT LOAD CSV... :如果您尝试按照操作手册:Neo4j Admin Import进行操作,并且您的movies.csv与该示例中的 movies.csv 相匹配,我建议您USING PERIODIC COMMIT LOAD CSV...进行更多操作:

  1. Stop the db.停止数据库。
  2. Put your csv at neo4j/import/myfile.csv .把你的 csv 放在neo4j/import/myfile.csv
    • If you're using Desktop: Project > DB > click the... on the right > Open Folder如果您使用的是桌面:项目 > 数据库 > 单击右侧的... > 打开文件夹
  3. Add the APOC plugin.添加 APOC 插件。
  4. Start the DB.启动数据库。

Next, open a browser instance, run the following (adjust for your data), and leave it until tomorrow:接下来,打开一个浏览器实例,运行以下命令(根据您的数据进行调整),并留到明天再执行:

USING PERIODIC COMMIT LOAD CSV FROM 'file:///myfile.csv' AS line
WITH line[3] AS nodeLabels, {
  id: line[0],
  title: line[1],
  year: toInteger(line[2])
} AS nodeProps
apoc.create.node(SPLIT(line[3],';',

Note: There are many ways to solve this problem, depending on your source data and the model you wish to create.注意:有很多方法可以解决此问题,具体取决于您的源数据和您希望创建的 model。 This solution is only meant to give you a handful of tools to help you get around the memory limit.此解决方案仅旨在为您提供一些工具来帮助您绕过 memory 限制。 If it is a simple CSV, and you don't care about what labels the nodes get initially, and you have headers, you can skip the complex APOC , and probably just do something like the following:如果它是一个简单的 CSV,并且您不关心节点最初获得的标签是什么,并且您有标头,则可以跳过复杂的APOC ,并且可能只执行类似以下的操作:

USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM 'file:///myfile.csv' AS line
CREATE (a :ImportedNode)
SET a = line

File for Each Label每个文件 Label

Original Asker mentioned having a separate csv for each label. In such instances it may be helpful to have a great-big single-command that can handle all of it, rather than needing to manually step through each step of the operation.原始 Asker 提到每个 label 都有一个单独的 csv。在这种情况下,拥有一个可以处理所有这些的单一命令可能会有所帮助,而不是需要手动逐步完成操作的每个步骤。

Assuming two label-types, each with a unique 'id' property, and one with a 'parent_id' referencing the other label...假设有两种标签类型,每种都有一个独特的“id”属性,一种带有引用另一个 label 的“parent_id”...

UNWIND [
  { file: 'country.csv', label: 'Country'},
  { file: 'city.csv', label: 'City'}
] AS importFile
USING PERIODIC COMMIT LOAD CSV FROM 'file:///' + importFile.file AS line

CALL apoc.merge.node([importFile.label], {id: line.id}) YIELD node
SET node = line
;

// then build the relationships
MATCH (city :City) 
WHERE city.parent_id
MATCH (country :Country {id: city.parent_id)
MERGE (city)-[:IN]->(country)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM