简体   繁体   中英

How can I create nested list in pyspark?

I need to creat nested list .My txt data is like

(telophone number,time,delta time,lat,long)

... 
0544144,23,86,40.761650,29.940929
0544147,23,104,40.768749,29.968599
0545525,20,86,40.761650,29.940929
0538333,21,184,40.764679,29.929543
05477900,21,204,40.773071,29.975010
0561554,23,47,40.764694,29.927397
...

also my code is

from pyspark import SparkContext


sc = SparkContext()
rdd_data = sc.textFile("data2.txt")

rdd_data_1 = rdd_data.map(lambda line: line.split(","))

tel0 = rdd_data_1.map(lambda line: int(line[0]))
time1 = rdd_data_1.map(lambda line: int(line[1]))
deltaTime2 = rdd_data_1.map(lambda line: int(line[2]))
lat3 = rdd_data_1.map(lambda line: float(line[3]))
lon4 = rdd_data_1.map(lambda line: float(line[4]))

tel0_list =tel0.collect()
time1_list =time1.collect()
deltaTime2_list =deltaTime2.collect()
lat3_list =lat3.collect()
lon4_list =lon4.collect()

As you can see each column have a mean ; telophone , time , delta time ,etc. But also each line must be use a list . If I want to see first telephone number ;

print tel0_list[0]

input:

0544144

It works as well. But I need to create each line list with it.

For example

Data[ ] list can be a lıst for each line . If I want to see data[1] , my input have to be like

(0544147,23,104,40.768749,29.968599)

How can I make it ?

Thanks

Since your text file is in a csv format you can easily load it into a dataframe if you use Spark 2.x:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType

spark = SparkSession.builder.getOrCreate()

schema = StructType([
            StructField("tel", IntegerType(), True),
            StructField("time", IntegerType(), True),
            StructField("deltatime", IntegerType(), True),
            StructField("lat", DoubleType(), True),
            StructField("long", DoubleType(), True)
        ])

data = spark.read.csv("data2.txt", header=False, schema=schema)

Then you can access the data with:

>>> data.take(1)
[Row(tel=544144, time=23, deltatime=86, lat=40.76165, long=29.940929)]

Note: accessing data[1] in Spark does not make any sense since it is a distributed system.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM