读取 pyspark json 文件并连接

Question

我在 s3 存储桶中有 999 个 gz 文件。 我想全部阅读并将 pyspark 数据帧转换为 pandas 数据帧，但由于文件很大，这是不可能的。 我正在尝试采用不同的方法 - 读取每个 /gz 文件，然后将其转换为 pandas df - 减少列数，然后将其连接成一个大 pandas df。

spark_df = spark.read.json(f"s3a://my_bucket/part-00000.gz")

part-000000.gz - 这是压缩的 json，0000 是第一个，00999 是最后一个。 你能不能帮我把它们全部解压，然后再连接 pandas df。

逻辑：

读取所有的 json 文件：
spark_df = spark.read.json(f"s3a://my_bucket/part-00{}.gz")
转换为熊猫
pandas_df = spark_df.toPandas()
减少列（只需要很少的列）
pandas_df = pandas_df[["col1","col2","col3"]]
将所有 999 个 pandas df 合并为一个 full_df = pd.concat（for 循环，遍历所有 pandas 数据帧）

这是我脑子里的逻辑，但我很难编写代码。

编辑：我开始编写代码，但它没有显示 pandas_df：

for i in range(10,11):
    df_to_predict = spark.read.json(f"s3a://my_bucket/company_v20_dl/part-000{i}.gz")
    df_to_predict = df_to_predict.select('id','summary', 'website')
    df_to_predict = df_to_predict.withColumn('text', lower(col('summary')))
    df_to_predict = df_to_predict.select('id','text', 'website')
    df_to_predict = df_to_predict.withColumn("text_length", length("text"))
    df_to_predict.show()
    pandas_df = df_to_predict.toPandas()
    pandas_df.head()

我还注意到这个解决方案对于 part00001 / part00100 等 <- 范围不会“填满”零。

Answer 1

您的最终数据框是 200k 行 * 4 列 * 999 个文件 ~= 200M * 4 列，这仍然是 Pandas 的大型数据集。

现在 Pyspark 可以运行 Pandas 代码（分布式），除非有任何特定原因，否则我建议将其保留在 Pyspark 数据帧中或将其转换为 Pandas-on-Spark 数据帧，以防您需要特定的 Pandas 操作。

伪代码

df = spark.read.json(f"s3a://my_bucket")  # Read all, Spark will distribute your data and apply operations.
df = df.select('col1', 'col2', 'col3')
# df = df.withColumn('something', ...)

# Convert to Pandas-on-Spark dataframe which can apply Pandas operation but is distributed
pdf = df.to_pandas_on_spark()
pdf.groupby('col1').min()

参考： https ://spark.apache.org/docs/3.2.0/api/python/user_guide/pandas_on_spark/pandas_pyspark.html

Answer 2

这是一个可能的解决方案：

s3 = session.resource('s3')
my_bucket = s3.Bucket("bucket-name")
for obj in my_bucket.objects.filter(Prefix=f"/output/"): ##path to folder where parts are located.
    do something with obj
    my_bucket.download_file(obj.key,f'download/filename.gz')

参考Listing contents of a bucket with boto3

将它们全部转换为数据框并附加到根数据框。 我希望这有帮助

读取 pyspark json 文件并连接

问题描述

2 个解决方案

解决方案1
2 2022-12-15 21:07:24

解决方案2
1 2022-12-15 20:37:29

读取 pyspark json 文件并连接

问题描述

2 个解决方案

解决方案1 2 2022-12-15 21:07:24

解决方案2 1 2022-12-15 20:37:29

解决方案1
2 2022-12-15 21:07:24

解决方案2
1 2022-12-15 20:37:29