Tag[rdd] Recent Newest Questions

PySpark RDD: Manipulating Inner Array

I have a dataset (for example) The print statement returns [(1, [2, 3, 4, 5])] I now need to multiply everything in the sub-array by 2 across the ...

Adding NumpyArray Values in RDD Python from Dictionary

How do I add the values from the dictionary in RDD, respectively? I have the following: I want the final output to be the following in numpy array ...

Loop through RDD elements, read its content for further processing

I have a folder that contains n number of files. I am creating an RDD that contains all the filenames of above folder with the code below: I want ...

Splitting a text file based on empty lines in Spark

I am working on a really big file which is a very large text document almost 2GBs. Something like this - I want to read them in spark and split th ...

How do I convert list of elements to 1 or 0 in RDD Python?

I want to all the values in my_dict that are in the list [1, 2, 3, 4, 5] to be 1's and all the values that are not in the list gets a 0. How do I do t ...

PicklingError: Could not serialize object: IndexError: tuple index out of range

I initiated pyspark in cmd and performed below to sharpen my skills. When I execute a.take(1), I get "_pickle.PicklingError: Could not serialize ob ...

spark dataframe filter function not working

I am new to spark, we have a project which reads data from hbase and save it to rdd. The dataframe count is 5280000, here is the code: val df = spark ...

spark rdd filter after groupbykey

after groupByKey i want to filter the second element is not equal 1 and get ("b", (1, "m")),("b", (2, "n")), ("c", (1, "m")), ("c", (5, "m")) grou ...

Sorting an rdd after using groupbykey using values

I have JavaPairRDD as As groupbykey() doesn't maintain order orderby doesn't work here. I want to order the Iterable<Row> using some of the f ...

Pyspark reduce function causes StackOverflowError

I'm working with a fairly big dataframe (around 100 thousand rows, with the intent to reach 10 Mil) and it has the following structure: I'd like to ...

Sharing RDDs with storage level NONE among Spark jobs

I have multiple Spark jobs which share a part of the dataflow graph including an expensive shuffle operation. If I persist that RDD, I see huge improv ...

How to operate reduceByKey on a reduceByKey result

Im trying to execute a reduceByKey on a reduceByKey result. The goal is to see if we have long-tail effect in each year - long tail here means than i ...

Combining data from JSON and CSV files using Spark Core in Python

Trying to code a Python script that takes a JSON file and a number of CSV files from a Google Drive file, and analyzes and manipulates its data using ...

PySpark count() can't process 684 GB .txt file

I am using PySpark to see how many times each timestamp appears in this very large data set using count(). My data set is from a 684 GB .txt file. How ...

How can I portray my lm() model across different ggplot scatterplot differently?

I am currently regressing GDP on multiple factors (7 different variables to be exact), My x variable is quarterly Dates (2006-Q1 to 2020-Q4). I need n ...

Spark dataframe map root key with elements of array of another column of string type

Actually I am stuck in a problem where I have a dataframe with 2 columns having schema actions column actually contains as array of objects but it' ...

How to calculate average by category in pyspark streaming?

I have csv data coming as DStreams from traffic counters. Sample is as follows I want to calculate average speed (for each location) by vehicle cat ...

BigDL docker container error: Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe

I created a docker container from this BigDL image. when I tried to collect the predictions using collect() this error occurs: Py4JJavaError: An error ...

convert html to json using rdd.map

I have html file which I want to parse in pySpark. Example: but in my Notebook output I have problem with list elements. They are parsed inсorre ...

convert a pyspark dataframe column in databricks as a list without using rdd

I trying to collect the values of a pyspark dataframe column in databricks as a list. When I use the collect function , I get a list with extra va ...