简体   繁体   中英

How to remove utf format in a dataframe in Pyspark and convert column from string to Integer

I need to remove utf formatting and convert a column to Integer type.

Below is what I have done to remove the utf format

>>auction_data = auction_raw_data.map(lambda line: line.encode("ascii","ignore").split(","))
>>auction_Data.take(2)
>>[['8211480551', '52.99', '1.201505', 'hanna1104', '94', '49.99', '311.6'], ['8211480551', '50.99', '1.203843', 'wrufai1', '90', '49.99', '311.6']]

But, when I create a dataframe with for the same data using the schema, and try to retrieve particular data, I get the data prefixed with a " u' ".

>>schema = StructType([ StructField("auctionid", StringType(), True),
StructField("bid", StringType(), True),
StructField("bidtime", StringType(), True),
StructField("bidder", StringType(), True),
StructField("bidderrate", StringType(), True),
StructField("openbid", StringType(), True),
StructField("price", StringType(), True)])`  

>>xbox_df = sqlContext.createDataFrame(auction_data,schema)
>>xbox_df.registerTempTable("auction")
>>first_line = sqlContext.sql("select * from auction where auctionid=8211480551").collect()
>>for i in first_line:
>>   print i

>>Row(auctionid=u'8211480551', bid=u'52.99', bidtime=u'1.201505', bidder=u'hanna1104', bidderrate=u'94', openbid=u'49.99', price=u'311.6')
>>Row(auctionid=u'8211480551', bid=u'50.99', bidtime=u'1.203843', bidder=u'wrufai1', bidderrate=u'90', openbid=u'49.99', price=u'311.6')

How to remove the u' infront of the values, also I want to convert the bid value into an Integer. When I directly change in schema definition, I get an error saying " TypeError: IntegerType can not accept object in type ".Show less

I am loading a JSON and not using a schema, so I don't know if there's a difference. I have no issues when converting fields to int when using select . This is what I do:

from pyspark.sql.functions import *
...
df = df.select(col('intField').cast('int'))
df.show()
# prints Row(intField=123)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM