简体   繁体   English

如何读取文本文件并使用 PySpark 应用架构?

[英]How do I read a text file & apply a schema with PySpark?

.txt file looks like this: .txt 文件如下所示:

1234567813572468
1234567813572468
1234567813572468
1234567813572468
1234567813572468

When I read it in, and sort into 3 distinct columns, I return this (perfect):当我读入它并分类为 3 个不同的列时,我返回这个(完美):

df = spark.read.option("header"     , "false")\
               .option("inferSchema", "true" )\
               .text( "fixed-width-2.txt"    )

sorted_df = df.select(
    df.value.substr(1, 4).alias('col1'),
    df.value.substr(5, 4).alias('col2'),
    df.value.substr(8, 4).alias('col3'),
).show()

+----+----+----+
|col1|col2|col3|
+----+----+----+
|1234|5678|8135|
|1234|5678|8135|
|1234|5678|8135|
|1234|5678|8135|
|1234|5678|8135|
|1234|5678|8135|

However, if I were to read it again, and apply a schema...但是,如果我要再次阅读它并应用架构......

from pyspark.sql.types import *
schema = StructType([StructField('col1', IntegerType(), True),
                     StructField('col2', IntegerType(), True),
                     StructField('col3', IntegerType(), True)])
df_new = spark.read.csv("fixed-width-2.txt", schema=schema)
df_new.printSchema()
root
 |-- col1: integer (nullable = true)
 |-- col2: integer (nullable = true)
 |-- col3: integer (nullable = true)

The data from the file is gone:文件中的数据消失了:

df_new.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
+----+----+----+

So my question is, how can I read in this text file and apply a schema?所以我的问题是,我怎样才能读入这个文本文件并应用一个模式?

When reading with schema for col1 as int this value exceeds 1234567813572468 max int value.当使用col1的架构作为int读取时,此值超过1234567813572468最大 int 值。 instead read with LongType .而是使用LongType阅读。

schema = StructType([StructField('col1', LongType(), True)])
spark.read.csv("path",schema=schema).show()
#+----------------+
#|            col1|
#+----------------+
#|1234567813572468|
#|1234567813572468|
#|1234567813572468|
#|1234567813572468|
#|1234567813572468|
#+----------------+

Using RDD Api:

Easier way would be read the fixed width file using .textFile (results an rdd) then apply transformations using .map then convert to dataframe using the schema.更简单的方法是使用.textFile (结果为 rdd)读取固定宽度文件,然后使用.map应用转换,然后使用架构转换为dataframe


from pyspark.sql.types import *
schema = StructType([StructField('col1', IntegerType(), True),
                     StructField('col2', IntegerType(), True),
                     StructField('col3', IntegerType(), True)])
df=spark.createDataFrame(
spark.sparkContext.textFile("fixed_width.csv").\
map(lambda x:(int(x[0:4]),int(x[4:8]),int(x[8:12]))),schema)

df.show()
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#|1234|5678|1357|
#|1234|5678|1357|
#|1234|5678|1357|
#|1234|5678|1357|
#|1234|5678|1357|
#+----+----+----+

df.printSchema()
#root
# |-- col1: integer (nullable = true)
# |-- col2: integer (nullable = true)
# |-- col3: integer (nullable = true)

Using DataFrame Api:

df = spark.read.option("header"     , "false")\
               .option("inferSchema", "true" )\
               .text( "path")

sorted_df = df.select(
    df.value.substr(1, 4).alias('col1'),
    df.value.substr(5, 4).alias('col2'),
    df.value.substr(8, 4).alias('col3'),
)
#dynamic cast expression
casting=[(col(col_name).cast("int")).name(col_name) for col_name in sorted_df.columns]
sorted_df=sorted_df.select(casting)

#required dataframe
sorted_df.show()
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#|1234|5678|8135|
#|1234|5678|8135|
#|1234|5678|8135|
#|1234|5678|8135|
#|1234|5678|8135|
#+----+----+----+

#just in case if you want to change the types
schema = StructType([StructField('col1', IntegerType(), True),
                     StructField('col2', IntegerType(), True),
                     StructField('col3', IntegerType(), True)])

df=spark.createDataFrame(sorted_df.rdd,schema)
df.show()
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#|1234|5678|8135|
#|1234|5678|8135|
#|1234|5678|8135|
#|1234|5678|8135|
#|1234|5678|8135|
#+----+----+----+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM