简体   繁体   English

Spark - 如何从 S3 读取多个文件名为 Json 的文件

[英]Spark - How to Read Multiple Multiple Json Files With Filename From S3

I have a lot of line delimited json files in S3 and want to read all those files in spark and then read each line in the json and output a Dict/Row for that line with the filename as a column.我在 S3 中有很多行分隔的 json 文件,想在 spark 中读取所有这些文件,然后读取 json 和 output 中的每一行,该行的字典/行以文件名作为列。 How would I go about doing this in python in an efficient manner?我 go 如何以有效的方式在 python 中执行此操作? Each json is approx 200 MB.每个 json 大约是 200 MB。

Here is an example of a file (there would be 200,000 rows like this), call this file class_scores_0219:这是一个文件示例(这样会有 200,000 行),将此文件称为 class_scores_0219:

{"name": "Maria C", "class":"Math", "score":"80", "student_identification":22}
{"name": "Maria F", "class":"Physics", "score":"90", "student_identification":12}
{"name": "Fink", "class":"English", "score":"75", "student_identification":7}

The output DataFrame would be (for simplicity just showing one row): output DataFrame 将是(为简单起见,只显示一行):

+-------------------+---------+-------+-------+------------------------+
|     file_name     |  name   | class | score | student_identification |
+-------------------+---------+-------+-------+------------------------+
| class_scores_0219 | Maria C | Math  |    80 |                     22 |
+-------------------+---------+-------+-------+------------------------+

I have set the s3 secret key/ acesss key using this: sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", SECRET_KEY) (same thing for the access key), but can connect in a different way need be.我已经使用以下方法设置了 s3 密钥/访问密钥: sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", SECRET_KEY) (访问密钥相同),但可以以不同的方式连接需要是。

I am open to whatever option is the most efficient, I can supply the list of files and feed that in or I can connect to boto3 and supply a prefix.我愿意接受最有效的任何选项,我可以提供文件列表并将其输入,或者我可以连接到 boto3 并提供前缀。 I am new to Spark so I appreciate all assistance.我是 Spark 的新手,所以感谢所有帮助。

You can achieve this by using spark itself.您可以通过使用 spark 本身来实现这一点。

Just add a new column with input_file_names and you will get your required result只需添加一个带有 input_file_names 的新列,您就会得到所需的结果

from pyspark.sql.functions import input_file_name
df = spark.read.json(path_to_you_folder_conatining_multiple_files)
df = df.withColumn('fileName',input_file_name())

If you want to read multiple files you can pass them as list of files如果你想读取多个文件,你可以将它们作为文件列表传递

files = [file1, file2, file3]
df = spark.read.json(*files)

Or if your list of files matches a wildcard then you can use it like below或者,如果您的文件列表与通配符匹配,那么您可以像下面这样使用它

df = spark.read.json('path/to/file/load2020*.json')

Or you can use boto3 to list all the object in the folder then create a list of required files and pass it to df.或者您可以使用 boto3 列出文件夹中的所有 object,然后创建所需文件的列表并将其传递给 df。

Hope it helps.希望能帮助到你。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark:如何使用子集日期读取多个s3文件 - Spark: How to read multiple s3 files using subset date 如何在python中更快地从s3读取和处理多个文件? - How to read and process multiple files from s3 faster in python? 我可以将多个文件从 S3 读入 Spark 数据帧,并忽略不存在的文件吗? - Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones? 在 Spark 中并行读取来自不同 aws S3 的多个文件 - Reading multiple files from different aws S3 in Spark parallelly 仅从多个文件夹的 s3 存储桶中读取特定的 json 文件 - read only particular json files from s3 buckets from multiple folders 在 python 中更快地从 s3 读取多个文件 - Read multiple files from s3 faster in python 如何将 Json Gzip 文件从 S3 读取到列表中? - How to read Json Gzip files from S3 into List? 如何从python中的aws s3编辑/重命名/删除多个json文件中的键和值 - how to edit/rename/remove keys and values in multiple json files from aws s3 in python 在 S3 目录中读取多个 json 文件然后将它们加载到 MySQL 表的最佳方法是什么? - What is the best way to read multiple json files in a S3 directory and then load them to a MySQL table? 如何将文件夹中的多个文件从 s3 加载到 Python Notebooks - how to load multiple files in a folder from s3 to Python Notebooks
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM