简体繁体 English

两个 Spark 结构化流作业无法写入相同的基本路径

[英]Two Spark structured streaming jobs cannot write to same base path

原文 2020-01-17 09:05:01 8 1 apache-spark/ spark-streaming/ spark-structured-streaming

Spark Structured Streaming doesn't allow two structured streaming jobs to write data to the same base directory which is possible with using dstreams. Spark Structured Streaming 不允许两个结构化流作业将数据写入同一个基本目录，这可以通过使用 dstreams 实现。

As _spark_metadata directory will be created by default for one job, second job cannot use the same directory as base path as already _spark_metadata directory is created by other job, It is throwing exception.由于 _spark_metadata 目录将默认为一个作业创建，第二个作业不能使用与基本路径相同的目录，因为其他作业已经创建了 _spark_metadata 目录，它抛出异常。

Is there any workaround for this, other than creating separate base path's for both the jobs.除了为两个作业创建单独的基本路径之外，是否有任何解决方法。

Is it possible to create the _spark_metadata directory else where or disable without any data loss.是否可以在其他位置创建 _spark_metadata 目录或禁用而不会丢失任何数据。

If I had to change the base path for both the jobs, then my whole framework will get impacted, So i don't want to do that.如果我必须更改两个作业的基本路径，那么我的整个框架都会受到影响，所以我不想这样做。

1 个解决方案

No, changing the metadata directory name or location is not possible yet.不，尚无法更改元数据目录名称或位置。 You can refer to this link for more information.您可以参考此链接了解更多信息。

Could you eleborate why you will have to change model of your project for changing a path.您能否详细说明为什么必须更改项目模型才能更改路径。 Is the path hardcoded?路径是硬编码的吗？ Or are you reading this data in a particular manner no that would be affected?或者您是否以特定方式读取这些数据，不会受到影响？

Edit 1: You can use Partitions here.编辑 1：您可以在此处使用分区。 Example, if the data is stored as parquet you can have partitions in you base path.例如，如果数据存储为镶木地板，您可以在基本路径中进行分区。 You can add a column "src" that has source of data, like SW1 for Stream writer 1 and SW2 for Stream Writer 2.您可以添加具有数据源的列“src”，例如用于 Stream writer 1 的 SW1 和用于 Stream Writer 2 的 SW2。

These will have below paths in hdfs:这些将在 hdfs 中具有以下路径：

<base-path>/src=SW1 <基本路径>/src=SW1
<base-path>/src=SW2 <基本路径>/src=SW2

Now your each job can directly write to their corresponding partition and your other jobs can continue reading from base path.现在您的每个作业都可以直接写入其对应的分区，而您的其他作业可以继续从基本路径读取。