简体   繁体   English

为每个文件创建一个包含架构数据的数据框

[英]Create a dataframe containing schema data for each file

I am trying to create a dataframe, then run a for loop which looks at a bunch of files.我正在尝试创建一个数据框,然后运行一个查看一堆文件的 for 循环。 Runs through each one and addeds a row to the dataframe for file.运行每一个并向文件的数据帧添加一行。 Containing the file name and the schema details?包含文件名和架构详细信息?

# Schema    
schema = StructType([
    StructField("filename", StringType(), True),
    StructField("converteddate", StringType(), True),
    StructField("eventdate", StringType(), True)
])


# Create empty dataframe
df = spark.createDataFrame(sc.emptyRDD(), schema)


for files in mvv_list:
    loadName = files
    videoData = spark.read\
                     .format('parquet')\
                     .options(header='true', inferSchema='true')\
                     .load(loadName)
    dataTypeList = videoData.dtypes
    two = dataTypeList[:2]
    print(loadName)
    print(two)

#mnt/master-video/year=2018/month=03/day=24/part-00004-tid-28948428924977-e0fc2-c85b-4296-8a05-94c5af6-2427-c000.snappy.parquet
#[('converteddate', 'timestamp'), ('eventdate', 'timestamp')]

#mnt/master-video/year=2017/month=05/day=12/part-00004-tid-2894842977-e0f21c2-c85b-4296-8a05-94c5af6-2427-c000.snappy.parquet
#[('converteddate', 'timestamp'), ('eventdate', 'date')]

#mnt/master-video/year=2016/month=03/day=24/part-00004-tid-2884924977-e0f2512-c8b-4296-8a05-945a6-2427-c000.snappy.parquet
#[('converteddate', 'timestamp'), ('eventdate', 'string')]

I am struggling to create a row and append it to the dataframe.我正在努力创建一行并将其附加到数据框。

Wanted output想要的输出

+-----------------------------+-----------------+---------------------+
|filename                     |converteddate    |eventdate            |
+-----------------------------+-----------------+---------------------+
|mnt/master-video/year=2018...|timestamp        |timestamp            |
|mnt/master-video/year=2017...|timestamp        |date                 |
|mnt/master-video/year=2016...|timestamp        |string               |
+-----------------------------+-----------------+---------------------+

One way is to build your desired data as a list, and then create the DataFrame after (instead of trying to append rows)一种方法是将所需的数据构建为列表,然后在之后创建 DataFrame(而不是尝试附加行)

data = []
for files in mvv_list:
    loadName = files
    videoData = spark.read\
                     .format('parquet')\
                     .options(header='true', inferSchema='true')\
                     .load(loadName)
    dataTypeDict = dict(videoData.dtypes)
    data.append((loadName, dataTypeDict['converteddate'], dataTypeDict['eventdate']))

schema = StructType([
    StructField("filename", StringType(), True),
    StructField("converteddate", StringType(), True),
    StructField("eventdate", StringType(), True)
])

df = spark.createDataFrame(data, schema)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何为存储在嵌套 JSON 文件中的数据库模式中的每个表元数据(列名、类型、格式)创建 Pandas dataframe - How to create a Pandas dataframe for each table meta data (Column Name, Type, Format) stored within a Database Schema in nested JSON file 如何从同时列出数据和架构的 JSON 文件创建 Spark-SQL dataframe - How to create a Spark-SQL dataframe from JSON file where data and schema are both listed 创建一个新的 Dataframe 列,其中包含来自两个现有列的字典,每个列都包含列表 - Create a new Dataframe column containing a dictionary from two existing columns each containing lists 如何为数据框中的每一列创建一个 csv 文件? - How to create a csv file for each column in a dataframe? 有没有办法将包含事件持续时间的日期索引数据框转换为显示每天事件的二进制数据的数据框? - Is there a way to turn a date-indexed dataframe containing durations of events, into a dataframe of binary data showing event for each day? 如何将包含 JSON 数据的 TXT 文件读入 python Pandas Z6A8064B5DF479455500553C47C50 - How to read TXT file containing JSON data into python Pandas dataframe Pandas:如何根据每行包含 json 的列值创建新的 dataframe? - Pandas: how to create a new dataframe depending on a column value containing json for each row? 如何创建包含每只股票的所有统计数据的唯一数据框? - How can I create a unique dataframe containing all the statistics for each stock? 从URL列表中刮取表格数据(每个URL包含一个唯一的表格),以便将其全部附加到单个列表/数据框架中? - Scraping table data from a list of URLs (each containing a unique table) for the purposes of appending it all to a single list/dataframe? 从包含每个标签的数据框中提取行 - Extract rows from dataframe containing each label
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM