简体   繁体   English

Gremlin 加载数据格式

[英]Gremlin load data format

I am having difficulty understanding the Gremlin data load format (for use with Amazon Neptune).我很难理解 Gremlin 数据加载格式(用于 Amazon Neptune)。

Say I have a CSV with the following columns:假设我有一个包含以下列的 CSV:

  • date_order_created
  • customer_no
  • order_no
  • zip_code
  • item_id
  • item_short_description

  • The requirements for the Gremlin load format are that the data is in an edge file and a vertex file. Gremlin 加载格式的要求是数据位于边文件和顶点文件中。
  • The edge file must have the following columns: id , label , from and to .边缘文件必须包含以下列: idlabelfromto
  • The vertex file must have: id and label columns.顶点文件必须有: idlabel列。


My questions:我的问题:

  • Which columns need to be renamed to id , label , from and to ?哪些列需要重命名为idlabelfromto Or , should I add new columns?或者,我应该添加新列吗?
  • Do I only need one vertex file or multiple?我只需要一个顶点文件还是多个?

You can have one or more of each CSV file (nodes, edges) but it is recommended to use fewer large files rather than many smaller ones.您可以拥有每个 CSV 文件(节点、边)中的一个或多个,但建议使用较少的大文件而不是许多较小的文件。 This allows the bulk loader to split the file up and load it in a parallel fashion.这允许批量加载程序拆分文件并以并行方式加载它。

As to the column headers, let's say you had a node (vertex) file of the form:至于列标题,假设您有一个以下形式的节点(顶点)文件:

~id,~label,name,breed,age:Int
dog-1,Dog,Toby,Retriever,11
dog-2,Dog,Scamp,Spaniel,12

The edge file (for dogs that are friends), might look like this边缘文件(对于作为朋友的狗),可能看起来像这样

~id,~label,~from,~to
e-1,FRIENDS_WITH,dog-1,dog-2

In Amazon Neptune, so long as they are unique, any user provided string can be used as a node or edge ID.在 Amazon Neptune 中,只要它们是唯一的,任何用户提供的字符串都可以用作节点或边缘 ID。 So in your example, if customer_no is guaranteed to be unique, rather than store it as a property called customer_no you could instead make it the ~id .因此,在您的示例中,如果customer_no保证是唯一的,而不是将其存储为名为customer_no的属性,您可以将其设为~id This can help later with efficient lookups.这有助于以后进行高效查找。 You can think of the ID as being a bit like a Primary Key in a relational database.您可以将 ID 看作是关系数据库中的主键。

So in summary, you need to always provide the required fields like ~id and ~label .因此,总而言之,您需要始终提供必填字段,例如~id~label They are accessed differently using Gremlin steps such as hasLabel and hasId once the data is loaded.加载数据后,使用hasLabelhasId等 Gremlin 步骤以不同方式访问它们。 Columns with names from your domain like order_no will become properties on the node or edge they are defined with, and will be accessed using Gremlin steps such as has('order_no', 'ABC-123')名称来自您的域(如order_no的列将成为定义它们的节点或边上的属性,并将使用 Gremlin 步骤(例如has('order_no', 'ABC-123')进行访问

To follow on from Kelvin's response and provide some further detail around data modeling...继续 Kelvin 的回应并提供有关数据建模的更多详细信息......

Before getting to the point of loading the data into a graph database, you need to determine what the graph data model will look like.在开始将数据加载到图形数据库之前,您需要确定图形数据 model 的外观。 This is done by first deriving a "naive" approach of how you think the entities in the data are connected and then validating this approach by asking the relevant questions (which will turn into queries) that you want to ask of the data.这是通过首先推导出您认为数据中的实体如何连接的“天真”方法,然后通过询问您想要询问数据的相关问题(这将变成查询)来验证该方法来完成的。

By way of example, I notice that your dataset has information related to customers, orders, and items.例如,我注意到您的数据集包含与客户、订单和商品相关的信息。 It also has some relevant attributes related to each.它还具有与每个相关的一些相关属性。 Knowing nothing about your use case, I may derive a "naive" model that looks like:对您的用例一无所知,我可能会得出一个“天真的” model ,它看起来像:

在此处输入图像描述

What you have with your original dataset appears similar to what you might see in a relational database as a Join Table .您拥有的原始数据集看起来与您在关系数据库中看到的类似Join Table This is a table that contains multiple foreign keys (the ids/no's fields) and maybe some related properties for those relationships.这是一个包含多个外键(ids/no 的字段)以及这些关系的一些相关属性的表。 In a graph, relationships are materialized through the use of edges.在图中,关系是通过使用边来具体化的。 So in this case, you are expanding this join table into the original set of entities and the relationships between each.所以在这种情况下,您要将此连接表扩展到原始实体集以及每个实体之间的关系。

To validate that we have the correct model, we then want to look at the model and see if we can answer relevant questions that we would want to ask of this data.为了验证我们有正确的 model,然后我们想查看 model,看看我们是否可以回答我们想问的有关此数据的相关问题。 By example, if we wanted to know all items purchased by a customer, we could trace our finger from a customer vertex to the item vertex.例如,如果我们想知道客户购买的所有商品,我们可以将手指从客户顶点追踪到商品顶点。 Being able to see how to get from point A to point B ensures that we will be able to easily write graph queries for these questions later on.能够看到如何从 A 点到达 B 点确保我们以后能够轻松地为这些问题编写图形查询。

After you derive this model, you can then determine how best to transform the original source data into the CSV bulk load format.在推导出这个 model 之后,您可以确定如何最好地将原始源数据转换为 CSV 批量加载格式。 So in this case, you would take each row in your original dataset and convert that to:因此,在这种情况下,您将获取原始数据集中的每一行并将其转换为:

For your vertices:对于你的顶点:

~id, ~label, zip_code, date_order_created, item_short_description
customer001, Customer, 90210, ,
order001, Order, , 2023-01-10,
item001, Item, , , "A small, non-descript black box"

Note that I'm reusing the no's/ids for the customer, item, and order as the ID for their related vertices.请注意,我将客户、商品和订单的编号/ID 重新用作其相关顶点的 ID。 This is always good practice as you can then easily lookup a customer, order, or item by that ID.这始终是一个很好的做法,因为您可以轻松地通过该 ID 查找客户、订单或项目。 Also note that the CSV becomes a sparse 2-dimensional array of related entities and their properties.另请注意,CSV 成为相关实体及其属性的稀疏二维数组。 I'm only providing the properties related to each type of vertex.我只提供与每种类型的顶点相关的属性。 By leaving the others blank, they will not be created.通过将其他人留空,将不会创建它们。

For your edges, you then need to materialize the relationships between each entity based on the fact that they are related by being in the same row of your source "join table".对于您的边缘,您需要根据每个实体之间的关系来具体化它们之间的关系,因为它们位于源“连接表”的同一行中。 These relationships did not previously have a unique identifier, so we can create one (it can be arbitrary or based on other parts of the data; it just needs to be unique).这些关系以前没有唯一标识符,所以我们可以创建一个(它可以是任意的,也可以基于数据的其他部分;它只需要是唯一的)。 I like using the vertex IDs of the two related vertices and the label of the relationship when possible.我喜欢尽可能使用两个相关顶点的顶点 ID 和关系的 label。 For the ~from and ~to fields, we are including the vertices from which the relationship is deriving and what it is applying to, respectively:对于 ~from 和 ~to 字段,我们分别包括关系派生的顶点和它所应用的对象:

~id, ~label, ~from, ~to
customer001-has_ordered-order001, has_ordered, customer001, order001
order001-contains-item001, contains, order001, item001

I hope that adds some further color and reasoning around how to get from your source data and into the format that Kelvin shows above.我希望围绕如何从您的源数据中获取并转换为 Kelvin 上面显示的格式添加一些更多的颜色和推理。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM