I'm using dataframe in spark to split and store data in a tablular format. My data in file looks as below -
{"click_id": 123, "created_at": "2016-10-03T10:50:33", "product_id": 98373, "product_price": 220.50, "user_id": 1, "ip": "10.10.10.10"}
{"click_id": 124, "created_at": "2017-02-03T10:51:33", "product_id": 97373, "product_price": 320.50, "user_id": 1, "ip": "10.13.10.10"}
{"click_id": 125, "created_at": "2017-10-03T10:52:33", "product_id": 96373, "product_price": 20.50, "user_id": 1, "ip": "192.168.2.1"}
and I've written this code to split the data -
from pyspark.sql import Row
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
import pyspark.sql.functions as psf
spark = SparkSession \
.builder \
.appName("Hello") \
.config("World") \
.getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)
ratings = spark.createDataFrame(
sc.textFile("transactions.json").map(lambda l: l.split(',')),
["Col1","Col2","Col3","Col4","Col5","Col6"]
)
ratings.registerTempTable("ratings")
final_df = sqlContext.sql("select * from ratings");
final_df.show(20,False)
The above code works fine and gives the below output :
As you can see from the output the "click_id and number"
is being shown, similarly created_at and timestamp
is being shown.
I want to actually have only the values in the table - click_id, created_at, product_id and so on.
How do I get only those values into my table ?
In your map function, parse the json object instead of splitting it
map(lambda l: l.split(','))
should become
map(lambda l: json.loads(l))
(after you have imported json)
import json
Also if you remove the columns definition
["Col1","Col2","Col3","Col4","Col5","Col6"]
you will get the columns from json
Assuming you want to use only the dataframe API, then you could use the following code:
ratings = spark.read.json("transactions.json")
This will load the json into a dataframe, mapping the json keys into column names. Then you can select and rename the columns with the code below.
ratings = ratings.select(col('click_id').alias('Col1'),
col('created_at').alias('Col2'),
col('product_id').alias('Col3'),
col('product_price').alias('Col4'),
col('user_id').alias('Col5'),
col('ip').alias('Col6'))
This way you can also cast columns into relevant datatypes, eg col('product_price').cast('double').alias('Col4')
and properly save to database.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.