简体   繁体   English

我如何迭代代码存储库中的 json 文件并逐步将 append 迭代到数据集

[英]How can i iterate over json files in code repositories and incrementally append to a dataset

I have imported a dataset with 100,000 raw json files of about 100gb through data connection into foundry.我通过数据连接将一个包含 100,000 个大约 100gb 的原始 json 文件的数据集导入铸造厂。 I want to use the Python Transforms raw file access transformation to read the files, Flatten array of structs and structs into a dataframe as an incremental update to df.我想使用Python Transforms raw file access转换来读取文件,将结构和结构的数组展平为 dataframe 作为对 df 的增量更新。 I want to use something like from the below example from the documentation for *.json files and also convert that into an incremental updated using @incremental() decorator.我想使用 *.json 文件的文档中的以下示例中的内容,并使用@incremental()装饰器将其转换为增量更新。

>>> import csv
>>> from pyspark.sql import Row
>>> from transforms.api import transform, Input, Output
>>>
>>> @transform(
...     processed=Output('/examples/hair_eye_color_processed'),
...     hair_eye_color=Input('/examples/students_hair_eye_color_csv'),
... )
... def example_computation(hair_eye_color, processed):
...
...    def process_file(file_status):
...        with hair_eye_color.filesystem().open(file_status.path) as f:
...            r = csv.reader(f)
...
...            # Construct a pyspark.Row from our header row
...            header = next(r)
...            MyRow = Row(*header)
...
...            for row in csv.reader(f):
...                yield MyRow(*row)
...
...    files_df = hair_eye_color.filesystem().files('**/*.csv')
...    processed_df = files_df.rdd.flatMap(process_file).toDF()
...    processed.write_dataframe(processed_df)

With the help of @Jeremy David Gamet i was able to develop the code to get the dataset i want.在@Jeremy David Gamet 的帮助下,我能够开发代码来获得我想要的数据集。

from transforms.api import transform, Input, Output
from  pyspark import *
import json


@transform(
     out=Output('foundry/outputdataset'),
     inpt=Input('foundry/inputdataset'),
 )
def update_set(ctx, inpt, out):
    spark = ctx.spark_session
    sc = spark.sparkContext

    filesystem = list(inpt.filesystem().ls())
    file_dates = []
    for files in filesystem:
        with inpt.filesystem().open(files.path,'r', encoding='utf-8-sig') as fi:
            data = json.load(fi)
        file_dates.append(data)

    json_object = json.dumps(file_dates)
    df_2 = spark.read.option("multiline", "true").json(sc.parallelize([json_object]))

    df_2.drop_duplicates()
# this code to [Flatten array column][1]
    df_2 = flatten(df_2)
    out.write_dataframe(df_2)

code to flatten__df 扁平化__df的代码

The above code works for few files, since the files are above 100,0000 i am hitting the following error:上面的代码适用于少数文件,因为文件超过 100,0000 我遇到以下错误:

Connection To Driver Lost 

This error indicates that connection to the driver was lost unexpectedly, which is often caused by the driver being terminated due to running out of memory. Common reasons for driver out-of-memory (OOM) errors include functions that materialize data to the driver such as .collect(), broadcasted joins, and using Pandas dataframes.

any way around this?有什么办法吗?

I have given an example of how this can be done dynamically as an answer to another question.我已经举了一个例子来说明如何动态地完成这个作为另一个问题的答案。

Here is the link to that code answer: How to union multiple dynamic inputs in Palantir Foundry?这是该代码答案的链接: 如何在 Palantir Foundry 中联合多个动态输入? and a copy of the same code:以及相同代码的副本:

from transforms.api import Input, Output, transform
from pyspark.sql import functions as F
import json
import logging


def transform_generator():
    transforms = []
    transf_dict = {## enter your dynamic mappings here ##}

    for value in transf_dict:
        @transform(
            out=Output(' path to your output here '.format(val=value)),
            inpt=Input(" path to input here ".format(val=value)),
        )
        def update_set(ctx, inpt, out):
            spark = ctx.spark_session
            sc = spark.sparkContext

            filesystem = list(inpt.filesystem().ls())
            file_dates = []
            for files in filesystem:
                with inpt.filesystem().open(files.path) as fi:
                    data = json.load(fi)
                file_dates.append(data)

            logging.info('info logs:')
            logging.info(file_dates)
            json_object = json.dumps(file_dates)
            df_2 = spark.read.option("multiline", "true").json(sc.parallelize([json_object]))
            df_2 = df_2.withColumn('upload_date', F.current_date())

            df_2.drop_duplicates()
            out.write_dataframe(df_2)
        transforms.append(update_logs)
    return transforms


TRANSFORMS = transform_generator()

Please let me know if there is anything I can clarify.如果有什么我可以澄清的,请告诉我。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将值从特定列滑到同一数据集中的特定列? - How can I slide over values from specific columns to specific columns within the same dataset? 如何使用 PySpark 将这么多 csv 文件(大约 130,000 个)有效地合并到一个大型数据集中? - How can I merge these many csv files (around 130,000) using PySpark into one large dataset efficiently? 遍历文件列表,提取其内容? (SparkContext错误) - Iterate over a list of files, extracting their contents? (SparkContext error) 如何遍历pyspark.sql.Column? - How to iterate over a pyspark.sql.Column? 如何迭代 pyspark 中的 dataframe 多列? - How to iterate over dataframe multiple columns in pyspark? 遍历pySpark目录中的文件以自动执行数据框和SQL表创建 - Iterate over files in a directory in pySpark to automate dataframe and SQL table creation TypeError:Column不可迭代 - 如何迭代ArrayType()? - TypeError: Column is not iterable - How to iterate over ArrayType()? 如何在 pyspark 中并行迭代批处理 DF - How to iterate over a batch DF parallely in pyspark 如何遍历“pyspark”中的列表列表以获得特定结果 - how can I iterate through list of list in "pyspark" for a specific result 如何根据数据集中的行长度过滤RDD。 - How can I filter an RDD based on length of lines in the dataset.?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM