简体   繁体   English

使用 Pyspark 将文件从 HDFS 中的一个目录移动到另一个目录

[英]Moving files from one directory to another directory in HDFS using Pyspark

I am trying to read data all the JSON files from one directory and storing them in Spark Dataframe using the code below.我正在尝试从一个目录中读取所有 JSON 文件的数据,并使用下面的代码将它们存储在 Spark Dataframe 中。 (it works fine) (它工作正常)

spark = SparkSession.builder.getOrCreate()


df = spark.read.json("hdfs:///user/temp/backup_data/st_in_*/*/*.json",multiLine=True)

but when I try to save the DataFrame with multiple files, using the code below但是当我尝试用多个文件保存 DataFrame 时,使用下面的代码

df.write.json("hdfs:///user/another_dir/to_save_dir/")

It doesn't store the files as expected and throws error like to_save_dir already exists它没有按预期存储文件并引发错误,例如to_save_dir已存在

I just want to save the files just like I read it from source dir to destination dir.我只想保存文件,就像我从源目录读取到目标目录一样。

edit:编辑:

The problem is that, when i read multiple files and want to write it in a directory, what is the procedure in Pyspark?问题是,当我读取多个文件并想将其写入一个目录时,Pyspark 中的程序是什么? The reason i am asking this is because once the spark load all the files it creates a single dataframe, and each file is a row in this dataframe, how should i proceed to create new file for each of the rows in dataframe我问这个的原因是因为一旦火花加载了所有文件,它会创建一个 dataframe,并且每个文件都是这个 dataframe 中的一行,我应该如何继续为 Z6A8064B5DF47945057070 中的每一行创建新文件

The error you get is quite clear, it seems the location you're trying to write into already exists.您得到的错误很清楚,您尝试写入的位置似乎已经存在。 You can choose to overwrite it by specifying the mode :您可以通过指定mode选择覆盖它:

df.write.mode("overwrite").json("hdfs:///user/another_dir/to_save_dir/")

However , if your intent is to only move files from one location to another in HDFS, you don't need to read the files in Spark and then write them.但是,如果您的意图是仅将文件从 HDFS 中的一个位置移动到另一个位置,则无需在 Spark 中读取文件然后写入它们。 Instead, try using Hadoop FS API :相反,请尝试使用Hadoop FS API

conf = sc._jsc.hadoopConfiguration()
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileUtil = sc._gateway.jvm.org.apache.hadoop.fs.FileUtil

src_path = Path(src_folder)
dest_path = Path(dest_folder)

FileUtil.copy(src_path.getFileSystem(conf), 
              src_path,
              dest_path.getFileSystem(conf),
              dest_path,
              True,
              conf)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM