简体   繁体   English

Apache Spark一次运行即可读取多个文本文件

[英]Apache Spark read multiple text files in single run

I can successfully load a text file into a DataFrame with the following Apache Spark Scala code: 我可以使用以下Apache Spark Scala代码将文本文件成功加载到DataFrame中:

val df = spark.read.text("first.txt")
  .withColumn("fileName", input_file_name())
  .withColumn("unique_id", monotonically_increasing_id())

Is there any way to provide the multiple files in the single run? 有没有办法在一次运行中提供多个文件? Something like this: 像这样:

val df = spark.read.text("first.txt,second.txt,someother.txt")
  .withColumn("fileName", input_file_name())
  .withColumn("unique_id", monotonically_increasing_id())

Right now the following code doesn't work with the following error: 现在,以下代码不适用于以下错误:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does not exist: file:first.txt,second.txt,someother.txt;
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:558)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)

How to properly load multiple text files? 如何正确加载多个文本文件?

The function spark.read.text() have a varargs parameter, from the docs : 函数spark.read.text()具有docs中的varargs参数:

def text(paths: String*): DataFrame

This means that to read multiple files you only need to supply them to the function separated by commas, ie 这意味着要读取多个文件,只需要将它们提供给以逗号分隔的功能,即

val df = spark.read.text("first.txt", "second.txt", "someother.txt")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM