简体   繁体   English

使用ETL将CSV数据加载到包含SPATIAL索引的OrientDB中

[英]Use ETL to load CSV data into OrientDB containing a SPATIAL index

I'm interested in loading some data into an OrientDB from some CSV files that contain spatial coordinates in WGS84 Lat/Long. 我对将某些数据从某些WGS84纬度/经度中包含空间坐标的CSV文件加载到OrientDB中感兴趣。

I'm using OrientDB 2.2.8 and have the lucene spatial module added to my $ORIENTDB_HOME/lib directory. 我正在使用OrientDB 2.2.8,并将lucene空间模块添加到$ ORIENTDB_HOME / lib目录中。

I'm loading my data into a database using ETL and would like to add the spatial index but I'm not sure how to do this. 我正在使用ETL将数据加载到数据库中,并想添加空间索引,但是我不确定如何执行此操作。

Say my CSV file has the following columns: 说我的CSV文件包含以下列:

  • Label (string) 标签(字符串)
  • Latitude (float) 纬度(浮动)
  • Longitude (float) 经度(浮动)

I've tried this in my ETL: 我已经在ETL中尝试过此操作:

"loader": {
    "orientdb": {
        "dbURL": "plocal:myDatabase.orientdb",
        "dbType": "graph",
        "batchCommit": 1000,
        "classes": [ { "name": "vertex", "extends", "V" } ],
        "indexes": [ { "class": "vertex", "fields":["Label:string"], "type":"UNIQUE" },
                     { "class": "Label", "fields":["Latitude:float","Longitude:float"], "type":"SPATIAL" }
                   ]
    }
}

but it's not working. 但它不起作用。 I get the following error message: 我收到以下错误消息:

ETL process has problem: com.orientechnologies.orient.core.index.OIndexException: Index with type SPATIAL and algorithm null does not exist.

Has anyone looked into creating spatial indices via ETL? 有没有人考虑通过ETL创建空间索引? Most of the stuff I'm seeing on this is using either Java or via direct query . 我看到的大多数东西都是使用Java或通过直接查询

Thanks in advance for any advice. 在此先感谢您的任何建议。

I was able to get it to load using the legacy spatial capabilities . 我能够使用传统的空间功能来加载它。

I put together a cheezy dataset that has some coordinates for a few of the Nazca line geoglyphs: 我整理了一个俗气的数据集,该数据集具有一些Nazca线地理字形的坐标:

Name,Latitude,Longitude
Hummingbird,-14.692131,-75.148892
Monkey,-14.7067274,-75.1475391
Condor,-14.6983457,-75.1283374
Spider,-14.694363,-75.1235815
Spiral,-14.688309,-75.122757
Hands,-14.694459,-75.113881
Tree,-14.693897,-75.114467
Astronaut,-14.745222,-75.079755
Dog,-14.706401,-75.130788

I used a script to create my GeoGlyph class, createVertexGeoGlyph.osql : 我使用脚本创建了我的GeoGlyph类createVertexGeoGlyph.osql

set echo true
connect PLOCAL:./nazca.orientdb admin admin
CREATE CLASS GeoGlyph EXTENDS V CLUSTERS 1
CREATE PROPERTY GeoGlyph.Name      STRING
CREATE PROPERTY GeoGlyph.Latitude  FLOAT
CREATE PROPERTY GeoGlyph.Longitude FLOAT
CREATE PROPERTY GeoGlyph.Tag       EMBEDDEDSET STRING
CREATE INDEX GeoGlyph.index.Location ON GeoGlyph(Latitude,Longitude) SPATIAL ENGINE LUCENE

which I load into my database using 我使用它加载到数据库中

$ console.sh createVertexGeoGlyph.osql

I do it this way because it seems to work more consistently for me. 我这样做是因为它似乎对我来说更加一致。 I've had some difficulties with getting the ETL engine to create defined properties when I've wanted it to off CSV imports. 当我希望ETL引擎关闭CSV导入时,我在使其创建定义的属性时遇到了一些困难。 Sometimes it wants to cooperate and create my properties and other times has trouble. 有时它想合作并创建我的财产,而其他时候则遇到麻烦。

So, the next step to get the data in is to create my .json files for the ETL process. 因此,获取数据的下一步是为ETL流程创建我的.json文件。 I like to make two, one that is file-specific and another that is a common file since often I have datasets that span multiple files. 我喜欢制作两个,一个是特定于文件的,另一个是一个普通的文件,因为经常会有跨越多个文件的数据集。

First, I have a my nazca_liens.json file: 首先,我有一个nazca_liens.json文件:

{
    "config": {
        "log": "info",
        "fileDirectory": "./",
        "fileName": "nazca_lines.csv"
    }
}

Next is the commonGeoGlyph.json file: 接下来是commonGeoGlyph.json文件:

{
    "begin": [
               { "let": { "name": "$filePath",  "expression": "$fileDirectory.append($fileName )" } },
             ],
    "config": { "log": "debug" },
    "source": { "file": { "path": "$filePath" } },
    "extractor":
        {
        "csv": { "ignoreEmptyLines": true,
                 "nullValue": "N/A",
                 "separator": ",",
                 "columnsOnFirstLine": true,
                 "dateFormat": "yyyy-MM-dd"
               }
        },
    "transformers": [
            { "vertex": { "class": "GeoGlyph" } },
            { "code":   { "language":"Javascript",
                          "code": "print('>>> Current record: ' + record); record;" }
            }
        ],
    "loader": {
        "orientdb": {
            "dbURL": "plocal:nazca.orientdb",
            "dbType": "graph",
            "batchCommit": 1000,
            "classes": [],
            "indexes": []
        }
    }
}

There's more stuff in the file than is necessary, I use it as a template for a lot of stuff. 文件中的东西比必要的还要多,我将其用作许多东西的模板。 In this case, I don't have to create my index in the ETL file itself because I already created it in the createVertexGeoGlyph.osql file. 在这种情况下,我不必在ETL文件本身中创建索引,因为我已经在createVertexGeoGlyph.osql文件中创建了它。

To load the data I just use the oetl.sh script: 要加载数据,我只需要使用oetl.sh脚本:

$ oetl.sh commonGeoGlyph.json nazca_lines.json

This is what's working for me... I'm sure there are better ways to do it, but this works. 这就是对我有用的方法……我敢肯定,有更好的方法可以做到这一点,但这是可行的。 I'm posting this here to tie off the question. 我将其张贴在这里以解决这个问题。 Hopefully someone will find this to be useful. 希望有人会发现这很有用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM