将SQL Server中的数据作为实木复合地板加载到S3-AWS EMR

Question

We have our data in SQL Server at the moment, we are trying to move them to our s3 bucket as parquet files. 目前，我们的数据已存储在SQL Server中，我们正尝试将它们作为镶木地板文件移动到s3存储桶中。 The intention is to analyse this s3 data in AWS EMR (Spark, Hive & Presto mainly). 目的是在AWS EMR（主要是Spark，Hive和Presto）中分析此s3数据。 We don't want to store our data in HDFS. 我们不想将数据存储在HDFS中。

What are the choices here? 这里有什么选择？ so far from our knowledge, it seems we can use either spark or sqoop for this import. 到目前为止，据我们所知，似乎我们可以使用spark或sqoop进行此导入。 Though sqoop is faster than Spark in this case due to parallelism (parallel db connections), it seems writing parquet file from sqoop to s3 is not possible - Sqoop + S3 + Parquet results in Wrong FS error . 尽管在这种情况下，由于并行性（并行数据库连接），sqoop比Spark快，但似乎无法从sqoop向s3写入Parquet文件-Sqoop + S3 + Parquet导致FS错误。 Workaround is to move to hdfs and then to s3. 解决方法是先移至hdfs，然后移至s3。 However this seems to be non-efficient. 但是，这似乎是无效的。 How about using SparkSQL to pull this data from SQL Server and write as parquet in s3 ? 如何使用SparkSQL从SQL Server提取此数据并在s3中以拼写形式写入？

Once we load this data as parquet in this format 一旦我们以这种格式将数据加载为拼花地板

 s3://mybucket/table_a/day_1/(parquet files 1 ... n). s3://mybucket/table_a/day_2/(parquet files 1 ... n). s3://mybucket/table_a/day_3/(parquet files 1 ... n).

How can I combine them together as a single table and query using Hive. 如何将它们合并为一个表并使用Hive查询。 I understand that we can create hive external table pointing to s3, but can we point to multiple files? 我知道我们可以创建指向s3的配置单元外部表，但是我们可以指向多个文件吗？

Thanks. 谢谢。

EDIT: Adding this as requested. 编辑：根据要求添加此内容。

org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:380) at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:257) at org.apache.hive.service.cli.operation.SQLOperation.access$800(SQLOperation.java:91) at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:348) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:362) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.ru org.apache.hive.service.cli.HiveSQLException：处理语句时出错：FAILED：执行错误，从org.apache.hive.service.cli的org.apache.hadoop.hive.ql.exec.DDLTask返回代码1。 org.apache.hive.service.cli.operation.SQLOperation.runQuery（SQLOperation.java:257）（位于org.apache.hive.service.cli.operation.SQLOperation处的operation.Operation.toSQLException（Operation.java:380）。在org.apache.hive.service.cli.operation.SQLOperation $ BackgroundWork $ 1.run（SQLOperation.java:348）处的访问$ 800（SQLOperation.java:91）在javax处的java.security.AccessController.doPrivileged（本机方法）。 org.apache.hadoop.security.UserGroupInformation.doAs（UserGroupInformation.java:1698）处的security.auth.Subject.doAs（Subject.java:422）在org.apache.hive.service.cli.operation.SQLOperation $ BackgroundWork。在java.util.concurrent.Executors $ RunnableAdapter.call（Executors.java:511）在java.util.concurrent.FutureTask.run（FutureTask.java:266）处运行（SQLOperation.java:362） .ThreadPoolExecutor.ru nWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) java.util.concurrent.ThreadPoolExecutor $ Worker.run（ThreadPoolExecutor.java:624）上的nWorker（ThreadPoolExecutor.java:1149）在java.lang.Thread.run（Thread.java:748）上

Answer 1

The Spark read jdbc pull the data with mutliple connections. Spark读取的jdbc具有多个连接来拉取数据。 Here is the link 链接在这里

def
jdbc(url: String, table: String, columnName: String, lowerBound: Long, upperBound: Long, numPartitions: Int, connectionProperties: Properties): 

Construct a DataFrame representing the database table accessible via JDBC URL url named table. Partitions of the table will be retrieved in parallel based on the parameters passed to this function.

Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.

url
JDBC database url of the form jdbc:subprotocol:subname.

table
Name of the table in the external database.

columnName
the name of a column of integral type that will be used for partitioning.

lowerBound
the minimum value of columnName used to decide partition stride.

upperBound
the maximum value of columnName used to decide partition stride.

numPartitions
the number of partitions. This, along with lowerBound (inclusive), upperBound (exclusive), form partition strides for generated WHERE clause expressions used to split the column columnName evenly. When the input is less than 1, the number is set to 1.

connectionProperties
JDBC database connection arguments, a list of arbitrary string tag/value. Normally at least a "user" and "password" property should be included. "fetchsize" can be used to control the number of rows per fetch.DataFrame

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader

http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases

Create hive table with partition columns as date and save and specify the following location 创建带有分区列作为日期的配置单元表，并保存并指定以下位置

create table table_name (
  id                int,
  dtDontQuery       string,
  name              string
)
partitioned by (date string) Location s3://s3://mybucket/table_name/

Add a column in your data called as date and populate it with sysdate. 在数据中添加称为日期的列，并用sysdate填充它。 You no need to add the column if it is not required, we can just populate the location. 如果不需要，则无需添加列，我们只需填充位置即可。 But it can be an audit column for your analytics also. 但这也可以是您的分析的审计列。 Use spark dataframe.partitionBy(date).write.parquet.location(s3://mybucket/table_name/) 使用spark dataframe.partitionBy(date).write.parquet.location(s3://mybucket/table_name/)

Daily Perform the MSCK repair on the hive table So the New Partition is added to the table. 每日MSCK repair on the hive table执行MSCK repair on the hive table因此将新分区添加到表中。

Apply the numPartitions on non numerical columns is by creating the hash function of that column into number of connections you want and use that column 通过将该列的哈希函数创建为所需的连接数并使用该列，将numPartitions应用于非数字列

Answer 2

Though I am little late, however for future reference. 虽然我来不晚，但是供以后参考。 In our project, we are exactly doing this and I would prefer Sqoop over Spark. 在我们的项目中，我们正是这样做的，我更喜欢Sqoop而不是Spark。

Reason: I used Glue to read data from Mysql to S3 and the reads are not parallel (Has AWS Support looks at it and that's how Glue(which uses Pyspark) work but writing to S3 once the read is complete its parallel). 原因：我使用Glue从Mysql读取数据到S3，并且读取不是并行的（让AWS Support查看它，这就是Glue（使用Pyspark）的工作方式，但是一旦读取完成其并行操作就写入S3）。 This is not efficient and its slow. 这不是有效的，而且速度很慢。 100GB of data to be read and written to S3 takes 1.5Hr. 要读取和写入S3的100GB数据需要1.5小时。

So i used Sqoop on EMR with Glue Catalog turned on(so hive metastore is on AWS) and i am able to write to S3 directly from Sqoop which is way faster 100GB of data read takes 20mins. 因此，我在启用了Glue Catalog的EMR上使用了Sqoop（因此，Hive metastore在AWS上），并且我能够直接从Sqoop写入S3，这比100GB的数据读取速度快20分钟。

You will have to set the set hive.metastore.warehouse.dir=s3:// and you should see you data being written to S3 if you do an hive-import or just direct write. 您将必须设置set hive.metastore.warehouse.dir = s3：//，并且如果您进行蜂巢导入或直接写入，则应该看到数据正在写入S3。

Answer 3

Spark is a pretty good utility tool. Spark是一个非常不错的实用工具。 You can easily connect to a JDBC data source , and you can write to S3 by specifying credentials and an S3 path (eg Pyspark Save dataframe to S3 ). 您可以轻松地连接到JDBC数据源，并且可以通过指定凭据和S3路径（例如，将Pyspark保存数据帧保存到S3 ）来写入S3 。

If you're using AWS, your best bet for Spark, Presto and Hive is to use the AWS Glue Metastore. 如果您使用的是AWS，那么Spark，Presto和Hive的最佳选择是使用AWS Glue Metastore。 This is a data catalog that registers your s3 objects as tables within databases, and provides an API for locating those objects. 这是一个数据目录，用于将s3对象注册为数据库中的表，并提供用于定位这些对象的API。

The answer to your Q2 is yes, you can have a table that refers to multiple files. 问题2的答案是肯定的，您可以有一个引用多个文件的表。 You'd normally want to do this if you have partitioned data. 如果您已对数据进行分区，通常需要这样做。

Answer 4

You can create the hive external table as follows 您可以按以下方式创建配置单元外部表

create external table table_a (
 siteid                    string,
 nodeid                    string,
 aggregation_type          string
 )
 PARTITIONED BY (day string)
 STORED AS PARQUET
 LOCATION 's3://mybucket/table_a';

Then you can run the following command to register the partition files stored under each days directory into HiveMatastore 然后，您可以运行以下命令以将存储在“ days”目录下的分区文件注册到HiveMatastore中

 MSCK REPAIR TABLE table_a;

Now you can access your files through hive queries. 现在，您可以通过配置单元查询访问文件。 We have used this approach in our project and working well. 我们在项目中使用了这种方法，并且效果良好。 After the above command, you can run the query 上面的命令之后，您可以运行查询

 select * from table_a where day='day_1';

Hope this helps. 希望这可以帮助。

-Ravi -Ravi

将SQL Server中的数据作为实木复合地板加载到S3-AWS EMR

问题描述

4 个解决方案

解决方案1
2 2018-02-03 02:37:27

解决方案2
2 2018-05-25 14:27:02

解决方案3
1 2018-02-02 22:35:23

解决方案4
0 已采纳 2018-02-03 02:54:25

将SQL Server中的数据作为实木复合地板加载到S3-AWS EMR

问题描述

4 个解决方案

解决方案1 2 2018-02-03 02:37:27

解决方案2 2 2018-05-25 14:27:02

解决方案3 1 2018-02-02 22:35:23

解决方案4 0 已采纳 2018-02-03 02:54:25

解决方案1
2 2018-02-03 02:37:27

解决方案2
2 2018-05-25 14:27:02

解决方案3
1 2018-02-02 22:35:23

解决方案4
0 已采纳 2018-02-03 02:54:25