operation inside map function in pyspark

Question

I want to take the data from file name(as it contains some info.) and write those in csvfile_info file without using loop . I am new in pyspark. Please some one help me in code and let me know how can i proceed. This is what i tried...

Code: c = os.path.join("-------")

input_file = sc.textFile(fileDir)
file1= input_file.split('_')
csvfile_info= open(c,'a')
details= file1.map(lambda p:
    name=p[0], 
    id=p[1],
    from_date=p[2],
    to_date=p[3],
    TimestampWithExtension=p[4]\
    file_timestamp=TimestampWithExtension.split('.')[0]\
    info = '{0},{1},{2},{3},{4},{5} \n'.\
    format(name,id,from_date,to_date,file_timestamp,input_file)\
    csvfile_info.write(info)
    )

Answer 1

Don't try to write the data inside of the map() function. You should instead map each record to the appropriate string, and then dump the resultant rdd to a file. Try this:

input_file = sc.textFile(fileDir)  # returns an RDD

def map_record_to_string(x):
    p = x.split('_')
    name=p[0]
    id=p[1]
    from_date=p[2]
    to_date=p[3]
    TimestampWithExtension=p[4]

    file_timestamp=TimestampWithExtension.split('.')[0]
    info = '{0},{1},{2},{3},{4},{5} \n'.format(
        name,
        id,
        from_date,
        to_date,
        file_timestamp,
        input_file
    )
    return info

details = input_file.map(map_record_to_string)  # returns a different RDD
details.saveAsTextFile("path/to/output")

Note: I haven't tested this code, but this is one approach you could take.

Explanation

From the docs , input_file = sc.textFile(fileDir) will return an RDD of strings with the file contents.

All of the operations you want to do are on the contents of the RDD, the elements of the file. Calling split() on the RDD doesn't make sense, because split() is a string function. What you want to do instead is call split() and the other operations on each record (line in the file) of the RDD. This is exactly what map() does.

An RDD is like an iterable, but you don't operate on it with a traditional loop. It's an abstraction that allows for parallelization. From the user's perspective the map(f) function applies the function f to each element in the RDD, as it would be done in a loop. Functionally calling input_file.map(f) is equivalent to the following:

# let rdd_as_list be a list of strings containing the contents of the file
map_output = []
for record in rdd_as_list:
    map_output.append(f(record))

Or equivalently:

# let rdd_as_list be a list of strings containing the contents of the file
map_output = map(f, rdd_as_list)

Calling map() on an RDD returns a new RDD, whose contents are the results of applying the function. In this case, details is a new RDD and it contains the rows of input_file after they have been processed by map_record_to_string .

You could have also written the map() step as details = input_file.map(lambda x: map_record_to_string(x)) if that makes it easier to understand.

operation inside map function in pyspark

Question

1 answers

solution1
2 ACCPTED 2018-01-17 15:12:33

operation inside map function in pyspark

Question

1 answers

solution1 2 ACCPTED 2018-01-17 15:12:33

solution1
2 ACCPTED 2018-01-17 15:12:33