Pyspark DataframeType error a: DoubleType can not accept object 'a' in type<class 'str'></class>

Question

我有這個 function

customSchema = StructType([ \
    StructField("a", Doubletype(), True), \
    StructField("b", Doubletype(), True),
    StructField("c", Doubletype(), True), 
    StructField("d", Doubletype(), True)])


n_1= sc.textFile("/path/*.txt")\
        .mapPartitions(lambda partition: csv.reader([line.replace('\0','') for line in partition],delimiter=';', quotechar='"')).filter(lambda line: len(line) > 1 )\
        .toDF(customSchema)

這將創建一個 Dataframe，問題是 '.mapPartitions' 將用作默認類型 <class 'str'> 我需要在將其轉換為 Dataframe 之前將其轉換為 DoubleType。 任何想法？

樣本數據

[['0,01', '344,01', '0,00', '0,00']]

或者只是與

n_1= sc.textFile("/path/*.txt")\
        .mapPartitions(lambda partition: csv.reader([line.replace('\0','') for line in partition],delimiter=';', quotechar='"')).filter(lambda line: len(line) > 1 )\

Answer 1

首先，需要收集所有元素並使用第二個選項創建一個矩陣（列表列表）。

n_1= sc.textFile("/path/*.txt")\
        .mapPartitions(lambda partition: csv.reader([line.replace('\0','') for line in partition],delimiter=';', quotechar='"')).filter(lambda line: len(line) > 1 )\
 

matrix  = n_1.collect()

一旦我們有了這個，就必須知道哪種類型的數據進入子列表（在我的例子中是'str'）。

matrix  =[[x.replace(',', '.') for x in i] for i in matrix ] # replace ',' for '.' in order to perform the data type convertion

matrix  = [[float(str(x)) for x in i] for i in matrix  ] #convert every sublist element into float

df = sc.parallelize(matrix).toDF()

Pyspark DataframeType error a: DoubleType can not accept object 'a' in type<class 'str'></class>

問題描述

1 個解決方案

解決方案1
0 2020-07-10 07:55:59

Pyspark DataframeType error a: DoubleType can not accept object 'a' in type<class 'str'></class>

問題描述

1 個解決方案

解決方案1 0 2020-07-10 07:55:59

解決方案1
0 2020-07-10 07:55:59