简体   繁体   English

使用 pyspark 中的 json 文件中的模式读取固定宽度文件

[英]Read fixed width file using schema from json file in pyspark

I have fixed width file as below我有如下固定宽度的文件

00120181120xyz12341
00220180203abc56792
00320181203pqr25483 

And a corresponding JSON file that specifies the schema:以及指定架构的相应JSON文件:

{"Column":"id","From":"1","To":"3"}
{"Column":"date","From":"4","To":"8"}
{"Column":"name","From":"12","To":"3"}
{"Column":"salary","From":"15","To":"5"}

I read the schema file into DataFrame using:我使用以下方法将架构文件读入 DataFrame:

SchemaFile = spark.read\
    .format("json")\
    .option("header","true")\
    .json('C:\Temp\schemaFile\schema.json')

SchemaFile.show()
#+------+----+---+
#|Column|From| To|
#+------+----+---+
#|    id|   1|  3|
#|  date|   4|  8|
#|  name|  12|  3|
#|salary|  15|  5|
#+------+----+---+

Likewise, I am parsing the fixed width file into a pyspark DataFrame as below:同样,我将固定宽度文件解析为 pyspark DataFrame,如下所示:

File = spark.read\
    .format("csv")\
    .option("header","false")\
    .load("C:\Temp\samplefile.txt")

File.show()
#+-------------------+
#|                _c0|
#+-------------------+
#|00120181120xyz12341|
#|00220180203abc56792|
#|00320181203pqr25483|
#+-------------------+

I can obviously hard code the values for the positions and lengths of each column to get the desired output:我显然可以对每列的位置和长度的值进行硬编码以获得所需的输出:

from pyspark.sql.functions import substring
data = File.select(
    substring(File._c0,1,3).alias('id'),
    substring(File._c0,4,8).alias('date'),
    substring(File._c0,12,3).alias('name'),
    substring(File._c0,15,5).alias('salary')
)

data.show()
#+---+--------+----+------+
#| id|    date|name|salary|
#+---+--------+----+------+
#|001|20181120| xyz| 12341|
#|002|20180203| abc| 56792|
#|003|20181203| pqr| 25483|
#+---+--------+----+------+

But how can I use the SchemaFile DataFrame to specify the widths and column names for the lines so that the schema can be applied dynamically (without hard coding) at run time?但是我如何使用SchemaFile DataFrame 来指定行的宽度和列名,以便可以在运行时动态应用模式(无需硬编码)?

The easiest thing to do here would be to collect the contents of SchemaFile and loop over its rows to extract the desired data.此处最简单的做法是collect SchemaFile的内容并遍历其行以提取所需数据。

First read the schema file as JSON into a DataFrame.首先将架构文件作为 JSON 读取到 DataFrame 中。 Then call collect and map each row to a dictionary:然后调用 collect 并将每一行映射到字典:

sfDict = map(lambda x: x.asDict(), SchemaFile.collect())
print(sfDict)
#[{'Column': u'id', 'From': u'1', 'To': u'3'},
# {'Column': u'date', 'From': u'4', 'To': u'8'},
# {'Column': u'name', 'From': u'12', 'To': u'3'},
# {'Column': u'salary', 'From': u'15', 'To': u'5'}]

Now you can loop over the rows in sfDict and use the values to substring your column:现在您可以遍历sfDict中的行并使用这些值来对您的列进行子字符串化:

from pyspark.sql.functions import substring
File.select(
    *[
        substring(
            str='_c0',
            pos=int(row['From']),
            len=int(row['To'])
        ).alias(row['Column']) 
        for row in sfDict
    ]
).show()
#+---+--------+----+------+
#| id|    date|name|salary|
#+---+--------+----+------+
#|001|20181120| xyz| 12341|
#|002|20180203| abc| 56792|
#|003|20181203| pqr| 25483|
#+---+--------+----+------+

Note that we have to cast To and From to integers since they are specified as strings in your json file.请注意,我们必须将ToFrom转换为整数,因为它们在您的json文件中被指定为字符串。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM