简体   繁体   中英

Pyspark DataframeType error a: DoubleType can not accept object 'a' in type <class 'str'>

I have this function

customSchema = StructType([ \
    StructField("a", Doubletype(), True), \
    StructField("b", Doubletype(), True),
    StructField("c", Doubletype(), True), 
    StructField("d", Doubletype(), True)])


n_1= sc.textFile("/path/*.txt")\
        .mapPartitions(lambda partition: csv.reader([line.replace('\0','') for line in partition],delimiter=';', quotechar='"')).filter(lambda line: len(line) > 1 )\
        .toDF(customSchema)

which would create a Dataframe, the problem is that '.mapPartitions' will use as default type <class 'str'> and i need to cast it to DoubleType before convert it into Dataframe. Any idea?

Sample data

[['0,01', '344,01', '0,00', '0,00']]

or just work with

n_1= sc.textFile("/path/*.txt")\
        .mapPartitions(lambda partition: csv.reader([line.replace('\0','') for line in partition],delimiter=';', quotechar='"')).filter(lambda line: len(line) > 1 )\
       

First, it was necesary to collect all the elements and create a matrix (list of lists) using the second option.

n_1= sc.textFile("/path/*.txt")\
        .mapPartitions(lambda partition: csv.reader([line.replace('\0','') for line in partition],delimiter=';', quotechar='"')).filter(lambda line: len(line) > 1 )\
 

matrix  = n_1.collect()

Once we have this, it is necesary to know which type of data comes into the sublists (in my case it was 'str').

matrix  =[[x.replace(',', '.') for x in i] for i in matrix ] # replace ',' for '.' in order to perform the data type convertion

matrix  = [[float(str(x)) for x in i] for i in matrix  ] #convert every sublist element into float

df = sc.parallelize(matrix).toDF()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM