简体   繁体   English

两个 Spark 结构化流作业无法写入相同的基本路径

[英]Two Spark structured streaming jobs cannot write to same base path

Spark Structured Streaming doesn't allow two structured streaming jobs to write data to the same base directory which is possible with using dstreams. Spark Structured Streaming 不允许两个结构化流作业将数据写入同一个基本目录,这可以通过使用 dstreams 实现。

As _spark_metadata directory will be created by default for one job, second job cannot use the same directory as base path as already _spark_metadata directory is created by other job, It is throwing exception.由于 _spark_metadata 目录将默认为一个作业创建,第二个作业不能使用与基本路径相同的目录,因为其他作业已经创建了 _spark_metadata 目录,它抛出异常。

Is there any workaround for this, other than creating separate base path's for both the jobs.除了为两个作业创建单独的基本路径之外,是否有任何解决方法。

Is it possible to create the _spark_metadata directory else where or disable without any data loss.是否可以在其他位置创建 _spark_metadata 目录或禁用而不会丢失任何数据。

If I had to change the base path for both the jobs, then my whole framework will get impacted, So i don't want to do that.如果我必须更改两个作业的基本路径,那么我的整个框架都会受到影响,所以我不想这样做。

No, changing the metadata directory name or location is not possible yet.不,尚无法更改元数据目录名称或位置。 You can refer to this link for more information.您可以参考此链接了解更多信息。

Could you eleborate why you will have to change model of your project for changing a path.您能否详细说明为什么必须更改项目模型才能更改路径。 Is the path hardcoded?路径是硬编码的吗? Or are you reading this data in a particular manner no that would be affected?或者您是否以特定方式读取这些数据,不会受到影响?

Edit 1: You can use Partitions here.编辑 1:您可以在此处使用分区。 Example, if the data is stored as parquet you can have partitions in you base path.例如,如果数据存储为镶木地板,您可以在基本路径中进行分区。 You can add a column "src" that has source of data, like SW1 for Stream writer 1 and SW2 for Stream Writer 2.您可以添加具有数据源的列“src”,例如用于 Stream writer 1 的 SW1 和用于 Stream Writer 2 的 SW2。

These will have below paths in hdfs:这些将在 hdfs 中具有以下路径:

  1. <base-path>/src=SW1 <基本路径>/src=SW1
  2. <base-path>/src=SW2 <基本路径>/src=SW2

Now your each job can directly write to their corresponding partition and your other jobs can continue reading from base path.现在您的每个作业都可以直接写入其对应的分区,而您的其他作业可以继续从基本路径读取。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在Spark结构化流中动态更改HDFS写入路径 - Change hdfs write path dynamically in spark structured streaming 可以在一个Spark作业中使用多个Spark结构化流写入查询(90+)吗? - Is it okay to have multiple spark structured streaming write query (90+) in one spark jobs? Spark Structured Streaming应用程序没有作业也没有阶段 - Spark Structured Streaming app has no jobs and no stages 在同一个 spark 结构化流作业中使用两个 WriteStreams - Using two WriteStreams in same spark structured streaming job 是否可以运行多个并行写入 S3 的 Spark Structured Streaming 作业? - Is it possible to run multiple Spark Structured Streaming jobs that write to S3 in parallel? 多个Spark作业通过分区将镶木地板数据附加到同一基本路径 - Multiple spark jobs appending parquet data to same base path with partitioning 如何将Spark结构化流数据写入Hive? - How to write Spark Structured Streaming Data into Hive? Spark结构化流:未指定路径错误 - Spark Structured Streaming: path not specified error 如何为Spark结构化流编写ElasticsearchSink - How to write ElasticsearchSink for Spark Structured streaming Spark结构化流式Kafka依赖性无法解决 - Spark structured streaming kafka dependency cannot be resolved
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM