简体繁体 English

将数据加载到Titan数据库中

[英]Loading data into Titan database

原文 2014-07-11 03:44:09 6 1 graph-databases/ titan/ faunus

I have a set of log data in the form of flat files from which I want to form a graph (based on information in the log) and load it into the Titan database. 我有一组平面文件形式的日志数据，我想从中形成一个图形（基于日志中的信息）并将其加载到Titan数据库中。 This data is a few gigabytes in size. 这个数据的大小是几千兆字节。 I am exploring bulk loading options Faunus and BatchGraph ( which I read about in https://github.com/thinkaurelius/titan/wiki/Bulk-Loading ) . 我正在探索批量加载选项Faunus和BatchGraph（我在https://github.com/thinkaurelius/titan/wiki/Bulk-Loading中读到过）。 The tab separated log data I have needs a bit of processing on each line of the file to form the graph nodes and edges I have in mind. 选项卡分隔的日志数据我需要对文件的每一行进行一些处理，以形成我想到的图形节点和边缘。 Will Faunus/BatchGraph serve this use case? Faunus / BatchGraph会服务于这个用例吗？ If yes, what format should my input file be in for these tools to work? 如果是，我的输入文件应采用什么格式才能使这些工具正常工作？ If not, is using the BluePrints API the way to go? 如果没有，是否正在使用BluePrints API？ Any resources you can share on your suggestion is very much appreciated since I'm a novice. 由于我是新手，因此非常感谢您可以在建议中分享的任何资源。 Thanks! 谢谢！

1 个解决方案

To answer your question in simple fashion, I think you will want to use Faunus to load your data. 要以简单的方式回答您的问题，我想您会想要使用Faunus来加载您的数据。 I would recommend cleaning and transforming your data with external tools first if possible. 如果可能的话，我建议先使用外部工具清理和转换数据。 Tab-delimited is a fine format, but how you prepare these file can have impact on loading performance (eg sometimes simply sorting the data the right way can provide a big speed boost.) 制表符分隔是一种很好的格式，但是如何准备这些文件会对加载性能产生影响（例如，有时简单地以正确的方式对数据进行排序可以大大提高速度。）

The more complete answer lies in these two resources. 更完整的答案在于这两种资源。 They should help you decide on an approach: 他们应该帮助您决定一种方法：

http://thinkaurelius.com/2014/05/29/powers-of-ten-part-i/ http://thinkaurelius.com/2014/06/02/powers-of-ten-part-ii/ http://thinkaurelius.com/2014/05/29/powers-of-ten-part-i/ http://thinkaurelius.com/2014/06/02/powers-of-ten-part-ii/

I would offer this additional advice - if you are truly a novice, I recommend that you find some slice of your data that produces somewhere between 100K and 1M edges. 我会提供这个额外的建议 - 如果你真的是一个新手，我建议你找到一些产生一些介于100K和1M边缘之间的数据。 Focus on simply loading that with BatchGraph or just the Blueprints API as described in Part I of those blog posts. 专注于简单地使用BatchGraph或Blueprints API加载，如第一部分博客文章中所述。 Get used to Gremlin a bit by querying the data in this small case. 通过查询这个小案例中的数据，习惯了Gremlin。 Use this time to develop methods for validating what you've loaded. 使用这段时间来开发验证已加载内容的方法。 Once you feel comfortable with all of that, then work on scaling it up to the full size. 一旦您对所有这些感到满意，那么就可以将其扩展到完整尺寸。