[英]How to load a csv file in a pyspark DataFrame
如何將csv
文件更改為DataFrame
。
csv 值 -
country,2015,2016,2017,2018,2019
Norway,4.141,4.152,4.157,4.166,4.168
Australia,4.077,4.086,4.093,4.110,4.115
Switzerland,4.009,4.036,4.032,4.041,4.046
Netherlands,3.977,3.994,4.043,4.045,4.045
UnitedStates,4.017,4.027,4.039,4.045,4.050
Germany,3.988,3.999,4.017,4.026,4.028
NewZealand,3.982,3.997,3.993,3.999,4.018
我想要 DataFrame/table 格式,如 -
+----------------------------------------+
| Country| 1980| 1985| 1990| 2000| 2005|
+----------+-----+-----+-----+-----+-----+
| Norway|4.141|4.152|4.157|4.166|4.168|
| Australia|4.077 ...
......
......
......
|NewZealand|.......................|4.018|
+----------------------------------------+
閱讀此處的文檔。 假設您的文件filename.csv
存儲在path
,那么您可以通過非常基本的配置導入它。
# Specify a schema
schema = StructType([
StructField('country', StringType()),
StructField('2015', StringType()),
StructField('2016', StringType()),
StructField('2017', StringType()),
StructField('2018', StringType()),
StructField('2019', StringType()),
])
# Start the import
df = spark.read.schema(schema)\
.format("csv")\
.option("header","true")\
.option("sep",",")\
.load("path/filename.csv")
請注意,您的數字將作為 String 導入,因為PySpark
無法識別thousands
分隔符點.
. 您必須將它們轉換為數字,如下所示 -
# Convert them to numerics
from pyspark.sql.functions import regexp_replace
cols_with_thousands_separator = ['2015','2016','2017','2018','2019']
for c in cols_with_thousands_separator:
df = df.withColumn(c, regexp_replace(col(c), '\\.', ''))\
.withColumn(c, col(c).cast("int"))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.