简体   繁体   English

如何将带有ETL的CSV边缘导入OrientDB图形?

[英]How to import Edges from CSV with ETL into OrientDB graph?

I'm trying to import edges from a CSV-file into OrientDB. 我正在尝试将边缘从CSV文件导入OrientDB。 The vertices are stored in a separate file and already imported via ETL into OrientDB. 顶点存储在单独的文件中,并且已经通过ETL导入到OrientDB中。 So my situation is similar to OrientDB import edges only using ETL tool and OrientDB ETL loading CSV with vertices in one file and edges in another . 因此,我的情况类似于仅使用ETL工具OrientDB ETL加载CSV的 OrientDB导入边 ,其中一个文件中包含顶点,而另一个文件中包含边


Update 更新资料

Friend.csv Friend.csv

"id","client_id","first_name","last_name"
"0","0","John-0","Doe"
"1","1","John-1","Doe"
"2","2","John-2","Doe"
...

The "id" field is removed by the Friend-Importer, but the "client_id" is stored. 朋友导入器删除了"id"字段,但存储了"client_id" The idea is to have a known client-side generated id for searching etc. 这个想法是要有一个已知的客户端生成的id来进行搜索等。

PeindingFriendship.csv PeindingFriendship.csv

"friendship_id","client_id","from","to"
"0","0-1","1","0"
"2","0-15","15","0"
"3","0-16","16","0"
...

The "friendship_id" and "client_id" should be imported as attributes of the "PendingFriendship" edge. 应将"friendship_id""client_id"作为"PendingFriendship"边缘的属性导入。 "from" is a "client_id" of a Friend. "from"是朋友的"client_id" "to" is a "client_id" of another Friend. "to"是另一个朋友的"client_id" For "client_id" exists a unique Index on both Friend and PendingFriendship . 对于"client_id" ,在FriendPendingFriendship上都存在唯一的索引。


My ETL configuration looks like this 我的ETL配置如下所示

...
"extractor": {
  "csv": {
  }
},
"transformers": [
  {
    "command": {
      "command": "CREATE EDGE PendingFriendship FROM (SELECT FROM Friend WHERE client_id = '${input.from}') TO (SELECT FROM Friend WHERE client_id = '${input.to}') SET client_id = '${input.client_id}'",
      "output": "edge"
    }
  },
  {
    "field": {
      "fieldName": "from",
      "expression": "remove"
    }
  },
  {
    "field": {
      "fieldName": "to",
      "operation": "remove"
    }
  },
  {
    "field": {
      "fieldName": "friendship_id",
      "expression": "remove"
    }
  },
  {
    "field": {
      "fieldName": "client_id",
      "operation": "remove"
    }
  },
  {
    "field": {
      "fieldName": "@class",
      "value": "PendingFriendship"
    }
  }
],
... 

The issue with this configuration is that it creates two edge entries. 此配置的问题在于它创建了两个边缘条目。 One is the expected "PendingFriendship" edge. 一种是预期的“ PendingFriendship”优势。 The second one is an empty "PendingFriendship" edge, with all the fields I removed as attributes with empty values. 第二个是空的“ PendingFriendship”边缘,我删除的所有字段均作为具有空值的属性。 The import fails, at the second row/document, because another empty "PendingFriendship" cannot be inserted because it violates a uniqueness constraint. 在第二行/文档中,导入失败,因为另一个空的“ PendingFriendship”违反了唯一性约束,因此无法插入。 How can I avoid the creation of the unnecessary empty "PendingFriendship". 我如何避免不必要的空“ PendingFriendship”的创建。 What is the best way to import edges into OrientDB? 将边导入OrientDB的最佳方法是什么? All the examples in the documentation use CSV files where vertices and edges are in one file, but this is not the case for me. 文档中的所有示例均使用CSV文件,其中顶点和边在一个文件中,但对我而言并非如此。

I also had a look into the Edge-Transformer , but it returns a Vertex not an Edge! 我还查看了Edge-Transformer ,但它返回的是Vertex而不是Edge!

Created PendingFriendships 建立待定友谊

After some time I found a way (workaround) to import the above data into OrientDB. 一段时间后,我找到了一种将上述数据导入OrientDB的方法(解决方法)。 Instead of using the ETL Tool I wrote simple ruby scripts which call the HTTP API of OrientDB using the Batch endpoint. 我没有使用ETL工具,而是编写了简单的ruby脚本,该脚本使用Batch端点调用OrientDB的HTTP API。

Steps: 脚步:

  1. Import the Friends. 导入好友。
  2. Use the response to create a mapping of client_ids to @rids . 使用响应创建一个client_ids@rids的映射。
  3. Parse the PeindingFriendship.csv and build batch requests. 解析PeindingFriendship.csv并建立batch请求。
  4. Each Friendships is created by its own command. 每个友谊都是由其自己的命令创建的。
  5. The mapping from 2. is used to insert the @rids into the command from 4. 来自2.的映射用于将@rids插入到来自4.的命令中。
  6. Send the batch requests in junks of 1000 commands. 以1000条命令的形式发送batch请求。

Example Batch-Request body: 示例批处理请求正文:

{
  "transaction" : true,
  "operations" : [
    {
      "type" : "cmd",
      "language" : "sql",
      "command" : "create edge PendingFriendship from #27:178 to #27:179 set client_id='4711'"
    }
  ]
}

This isn't the answer to the question I asked, but it solves the higher goal of importing data into OrientDB, for me. 这不是我提出的问题的答案,但对我来说,它解决了将数据导入OrientDB的更高目标。 Therefore I leave it open for the community to mark this question as solved or not. 因此,我让社区可以将此问题标记为已解决或未解决。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM