简体   繁体   English

将数据从无模式数据库迁移到关系数据库:MongoDB 到 Snowflake

[英]Migrate data from Schema free database to Relational Database : MongoDB to Snowflake

We had a use case that lead me to write this and I am sure many of you would have faced this situation.我们有一个用例引导我写这篇文章,我相信你们中的许多人都会遇到这种情况。 The situation was to migrate multiple collections from MongoDB into Snowflake Database through a single Talend job and retain the top level nodes of the collection as an individual field in Snowflake table.这种情况是通过单个 Talend 作业将多个集合从 MongoDB 迁移到 Snowflake 数据库,并将集合的顶级节点保留为 Snowflake 表中的单个字段。

Now as we know Talend does not support dynamic schema for MongoDB sources because of MongoDB collections do not enforce a schema , this means that we have to create separate jobs/sub-jobs for each existing/new collection that we would like to ingest also we have to redesign the jobs for future alterations in the documents while ensuring it will work all the time, thus we have to look into alternative solution.现在我们知道 Talend 不支持 MongoDB 源的动态模式,因为 MongoDB 集合不强制执行模式,这意味着我们必须为我们想要摄取的每个现有/新集合创建单独的作业/子作业必须重新设计作业以供将来更改文档,同时确保它始终有效,因此我们必须寻找替代解决方案。

Here is the approach ,这是方法,

Step One : Get all the top level keys and their types from MongoDB collection .第一步:从 MongoDB 集合中获取所有顶级键及其类型。 We have used aggregation with $objectToArrray to convert all top key and value pairs into document arrays followed by $unwind and $group with $addToSet to get distinct keys and value types across entire collection.我们使用$objectToArrray 的聚合将所有顶级键和值对转换为文档数组,然后是$unwind$group$addToSet以获得整个集合中不同的键和值类型。

 {
"_id" : "1",
"keys" : [ 
    "field1~string", 
    "field2~object", 
    "filed3~date",
    "_id~objectId"
 ]

} }

Step Two : Create a one to one map between Mongo Datatype and Snowflake Datatype.第二步:在 Mongo 数据类型和雪花数据类型之间创建一对一的映射。 We have created a hash map called as " dataTypes " to store this information .我们创建了一个名为“ dataTypes ”的哈希映射来存储这些信息。 Alternatively this information can be stored in a table or in a file etc.或者,此信息可以存储在表格或文件等中。

 java.util.Map<String,String> dataTypes = new java.util.HashMap<String,String>();
 dataTypes.put("string","VARCHAR");
 dataTypes.put("int","NUMBER");
 dataTypes.put("objectId","VARCHAR");
 dataTypes.put("object","VARIANT");
 dataTypes.put("date","TIMESTAMP_LTZ");
 dataTypes.put("array","VARCHAR");
 dataTypes.put("bool","BOOLEAN");

Step Three : Compare the keys against the Snowflake : First we query the snowflake INFORMATION_SCHEMA if the table exists or not , if it does not exist we create the table ,if it exists then we check for change in fields in the documents and add or modify those columns in snowflake table.第三步:将键与雪花进行比较:首先我们查询雪花INFORMATION_SCHEMA表是否存在,如果不存在我们创建表,如果存在则我们检查文档中字段的变化并添加或修改雪花表中的那些列。 The DDL script is generated by using the "Datatype Mapping" in the step two and iterating over the keys in the Step One DDL 脚本是通过使用第二步中的“数据类型映射”并迭代第一步中的键生成的

Step Four : Unload data from MongoDB to the local filesystem using the mongoexport command:第四步:使用mongoexport命令将数据从 MongoDB 卸载到本地文件系统:

mongoexport --db <databaseName> --collection <collectionName> --type=csv --fields=<fieldList> --out <filename>

the is prepared from the keys in Step One.是从第一步中的键准备的。

Step Five : Stage the .csv file from local filesystem to snowflake staging location using the PUT command using Snowsql .第五步:使用Snowsql使用PUT命令将 .csv 文件从本地文件系统转移到雪花暂存位置。

snowsql -d <database> -s <schema> -o exit_on_error=true -o log_level=DEBUG  -q  'put <fileName> @<internalStage> OVERWRITE=TRUE';

Step Six : Load the data from staging location to snowflake table第六步:将数据从暂存位置加载到雪花表

COPY INTO <tableName> FROM @<internalStage> 
[file_format=<fileFormat>] [pattern=<regex_pattern>]

Specifying the file_format and pattern are optional here, we have used a regular expression as we are staging multiple files for each collection in one snowflake stage.此处指定 file_format 和模式是可选的,我们使用了正则表达式,因为我们在一个雪花阶段为每个集合暂存多个文件。

Step Seven : Maintain a list of collections , the list can be placed in a file in local filesystem or in a database table and in Talend job iterate over the list of collections and process each collection through above steps by parametrizing the collection names, table names , file names and staging names etc in the job.第七步:维护一个集合列表,该列表可以放在本地文件系统或数据库表中的文件中,并在 Talend 作业中迭代集合列表并通过参数化集合名称、表名称来通过上述步骤处理每个集合、作业中的文件名和暂存名称等。

One solution is to load the records of your Mongodb collection into Snowflake field of variant type.一种解决方案是将您的 Mongodb 集合的记录加载到variant类型的 Snowflake 字段中。 Then, create a Snowflake view to extract the specific keys using Snowflake's dot notation .然后,创建一个 Snowflake 视图以使用 Snowflake 的点表示法提取特定键。

Export your data as JSON type.将您的数据导出为 JSON 类型。

mongoexport --type=json --out <filename>

Load that export into a table with a structure like the following.将该导出加载到具有如下结构的表中。

create table collection_name_exports (
  data variant,  -- This column will contain your export
  inserted_at datetime default current_timestamp()
);

Extract the keys into columns of a view as you need.根据需要将键提取到视图的列中。

create view collection_name_view as
select
  collection_name_exports:key1 as field1,
  collection_name_exports:key2 as field2
from collection_name_exports

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 与关系数据库相比,使用像MongoDB这样的无模式数据库有什么好处? - What are the advantages of using a schema-free database like MongoDB compared to a relational database? 我想将 SQL 数据库迁移到 MongoDB 那么我应该如何维护关系数据? - I want to migrate SQL Database to MongoDB So How I Supposed to maintain relational data? 将数据从现有数据库迁移到 Kube.netes 中新创建的 mongodb pod - Migrate data from an existing database to a newly created mongodb pod in Kubernetes 通过sailsjs将mongodb中的数据从旧模式迁移到新模式 - Migrate data in mongodb from old schema to new schema via sailsjs 将mongodb数据库从localhost迁移到远程服务器 - Migrate mongodb database from localhost to remote servers 将mongodb数据库迁移到heroku - migrate mongodb database to heroku 模式/模型是从 mongodb 数据库读取数据的必要条件吗? - Is schema/model a must for reading data from mongodb database 一个无模式的数据库/不像 mongodb 但具有酸事务支持 - a database schema-free/less like mongodb but with acid transaction support 将关系数据库模型转换为MongoDB - Convert Relational Database Model to MongoDB 具有共享数据的Mongodb数据库架构设计 - Mongodb database Schema Design with shared data
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM