简体   繁体   English

Pyspark DataframeType error a: DoubleType can not accept object 'a' in type<class 'str'></class>

[英]Pyspark DataframeType error a: DoubleType can not accept object 'a' in type <class 'str'>

I have this function我有这个 function

customSchema = StructType([ \
    StructField("a", Doubletype(), True), \
    StructField("b", Doubletype(), True),
    StructField("c", Doubletype(), True), 
    StructField("d", Doubletype(), True)])


n_1= sc.textFile("/path/*.txt")\
        .mapPartitions(lambda partition: csv.reader([line.replace('\0','') for line in partition],delimiter=';', quotechar='"')).filter(lambda line: len(line) > 1 )\
        .toDF(customSchema)

which would create a Dataframe, the problem is that '.mapPartitions' will use as default type <class 'str'> and i need to cast it to DoubleType before convert it into Dataframe.这将创建一个 Dataframe,问题是 '.mapPartitions' 将用作默认类型 <class 'str'> 我需要在将其转换为 Dataframe 之前将其转换为 DoubleType。 Any idea?任何想法?

Sample data样本数据

[['0,01', '344,01', '0,00', '0,00']]

or just work with或者只是与

n_1= sc.textFile("/path/*.txt")\
        .mapPartitions(lambda partition: csv.reader([line.replace('\0','') for line in partition],delimiter=';', quotechar='"')).filter(lambda line: len(line) > 1 )\
       

First, it was necesary to collect all the elements and create a matrix (list of lists) using the second option.首先,需要收集所有元素并使用第二个选项创建一个矩阵(列表列表)。

n_1= sc.textFile("/path/*.txt")\
        .mapPartitions(lambda partition: csv.reader([line.replace('\0','') for line in partition],delimiter=';', quotechar='"')).filter(lambda line: len(line) > 1 )\
 

matrix  = n_1.collect()

Once we have this, it is necesary to know which type of data comes into the sublists (in my case it was 'str').一旦我们有了这个,就必须知道哪种类型的数据进入子列表(在我的例子中是'str')。

matrix  =[[x.replace(',', '.') for x in i] for i in matrix ] # replace ',' for '.' in order to perform the data type convertion

matrix  = [[float(str(x)) for x in i] for i in matrix  ] #convert every sublist element into float

df = sc.parallelize(matrix).toDF()

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 pyspark createDataframe typeerror: structtype can not accept object 'id' in type<class 'str'></class> - pyspark createDataframe typeerror: structtype can not accept object 'id' in type <class 'str'> PySpark错误:StructType不能接受类型为0的对象<type 'int'> - PySpark Error: StructType can not accept object 0 in type <type 'int'> PySpark:TypeError:StructType不能接受类型的对象 <type 'unicode'> 要么 <type 'str'> - PySpark: TypeError: StructType can not accept object in type <type 'unicode'> or <type 'str'> PySpark无法将字典的RDD转换为DataFrame。 错误:无法接受类型中的对象 <class 'pyspark.sql.types.Row'> - PySpark can't convert RDD of dicts to DataFrame. Error: can not accept object in type <class 'pyspark.sql.types.Row'> TypeError:ArrayType(DoubleType,true)无法接受对象u&#39;..&#39; - TypeError: ArrayType(DoubleType,true) can not accept object u'..' 类型错误:字段 col1:LongType 不能接受类型中的对象“”<class 'str'> - TypeError: field col1: LongType can not accept object '' in type <class 'str'> pyspark:TypeError:IntegerType不能接受类型中的对象<type 'unicode'> - pyspark: TypeError: IntegerType can not accept object in type <type 'unicode'> TypeError:TimestampType无法接受对象 <class 'str'> 和 <class 'int'> - TypeError: TimestampType can not accept object <class 'str'> and <class 'int'> PySpark:TypeError:StructType不能接受类型为0.10000000000000001的对象<type 'numpy.float64'> - PySpark: TypeError: StructType can not accept object 0.10000000000000001 in type <type 'numpy.float64'> StructType不能接受pyspark中的对象浮点数 - StructType can not accept object float in pyspark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM