简体   繁体   English

在apache spark中组合来自不同来源的数据

[英]combining data from different sources in apache spark

I am exploring apache spark for a project where I want to get data from different sources - database tables (postgres and BigQuery), and text.我正在探索一个项目的 apache spark,我想从不同的来源获取数据 - 数据库表(postgres 和 BigQuery)和文本。 The data will be processed and fed into another table for analytics.数据将被处理并送入另一个表进行分析。 My choice of the programming language is Java, but I am exploring Python too.Can someone please let me know if I can read the directly into spark for processing?我选择的编程语言是 Java,但我也在探索 Python。有人可以告诉我是否可以直接将其读入 Spark 进行处理吗? Do I need some kind of connector between the database tables and the Spark cluster.我是否需要在数据库表和 Spark 集群之间使用某种连接器。

Thanks in advance.提前致谢。

If for example you want to read the content from a BigQuery table, you can do it through these instructions (Python for example):例如,如果您想从 BigQuery 表中读取内容,您可以通过以下说明(例如 Python)来完成:

words = spark.read.format('bigquery') \
   .option('table', 'bigquery-public-data:samples.shakespeare') \
   .load()

you can refer to this document [1] (here you can see also the instructions with Scala).你可以参考这个文档[1](在这里你也可以看到Scala的说明)。

***I recommend trying the wordcount code first to get used of the usage pattern**** ***我建议先尝试wordcount代码以习惯使用模式 ****

After that, and you have your Spark code ready, you have to create a new cluster in Google Dataproc [2] and run the job there, linking the BigQuery connector (example with python):之后,您准备好 Spark 代码,您必须在Google Dataproc [2] 中创建一个新集群并在那里运行作业,链接 BigQuery 连接器(使用 python 的示例):

gcloud dataproc jobs submit pyspark wordcount.py \
   --cluster cluster-name \
   --region cluster-region (example: "us-central1") \
   --jars=gs://spark-lib/bigquery/spark-bigquery-latest.jar

Here you can find the latest version of the BigQuery connector [3].您可以在此处找到最新版本的 BigQuery 连接器 [3]。

In addition, in this GitHub repository you can find some examples of how to use BigQuery connector with Spark [4].此外,在此 GitHub 存储库中,您可以找到一些有关如何将 BigQuery 连接器与 Spark [4] 结合使用的示例。

With these instructions you should be able to handle reading and writing BigQuery.通过这些说明,您应该能够处理读取和写入 BigQuery。

[1] https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example#running_the_code [1] https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example#running_the_code

[2] https://cloud.google.com/dataproc/docs/guides/create-cluster [2] https://cloud.google.com/dataproc/docs/guides/create-cluster

[3] gs://spark-lib/bigquery/spark-bigquery-latest.jar [3] gs://spark-lib/bigquery/spark-bigquery-latest.jar

[4]https://github.com/GoogleCloudDataproc/spark-bigquery-connector [4]https://github.com/GoogleCloudDataproc/spark-bigquery-connector

You can connect to rdbms using jdbc.您可以使用 jdbc 连接到 rdbms。 Spark has connector for BigQuery as well. Spark 也有用于 BigQuery 的连接器。 Read from all the sources into data frames separately and combine at the end (assuming all have the same data format)将所有源分别读入数据帧,最后合并(假设所有数据格式相同)

sample pseudo code for pyspark: pyspark 的示例伪代码:

df1=spark.read.json("s3://test.json") df2 = sqlContext.read.format("jdbc").option("url", "jdbc:mysql://xxxx").option("driver", "com.mysql.jdbc.Driver").option("table", "name").option("user", "user").option("password", "password").load() df1=spark.read.json("s3://test.json") df2 = sqlContext.read.format("jdbc").option("url", "jdbc:mysql://xxxx").option( "driver", "com.mysql.jdbc.Driver").option("table", "name").option("user", "user").option("password", "password").load( )

result = df1.union(df2)结果 = df1.union(df2)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM