I have a dataset (for example) The print statement returns [(1, [2, 3, 4, 5])] I now need to multiply everything in the sub-array by 2 across the ...
I have a dataset (for example) The print statement returns [(1, [2, 3, 4, 5])] I now need to multiply everything in the sub-array by 2 across the ...
How do I add the values from the dictionary in RDD, respectively? I have the following: I want the final output to be the following in numpy array ...
I have a folder that contains n number of files. I am creating an RDD that contains all the filenames of above folder with the code below: I want ...
I am working on a really big file which is a very large text document almost 2GBs. Something like this - I want to read them in spark and split th ...
I want to all the values in my_dict that are in the list [1, 2, 3, 4, 5] to be 1's and all the values that are not in the list gets a 0. How do I do t ...
I initiated pyspark in cmd and performed below to sharpen my skills. When I execute a.take(1), I get "_pickle.PicklingError: Could not serialize ob ...
I am new to spark, we have a project which reads data from hbase and save it to rdd. The dataframe count is 5280000, here is the code: val df = spark ...
after groupByKey i want to filter the second element is not equal 1 and get ("b", (1, "m")),("b", (2, "n")), ("c", (1, "m")), ("c", (5, "m")) grou ...
I have JavaPairRDD as As groupbykey() doesn't maintain order orderby doesn't work here. I want to order the Iterable<Row> using some of the f ...
I'm working with a fairly big dataframe (around 100 thousand rows, with the intent to reach 10 Mil) and it has the following structure: I'd like to ...
I have multiple Spark jobs which share a part of the dataflow graph including an expensive shuffle operation. If I persist that RDD, I see huge improv ...
Im trying to execute a reduceByKey on a reduceByKey result. The goal is to see if we have long-tail effect in each year - long tail here means than i ...
Trying to code a Python script that takes a JSON file and a number of CSV files from a Google Drive file, and analyzes and manipulates its data using ...
I am using PySpark to see how many times each timestamp appears in this very large data set using count(). My data set is from a 684 GB .txt file. How ...
I am currently regressing GDP on multiple factors (7 different variables to be exact), My x variable is quarterly Dates (2006-Q1 to 2020-Q4). I need n ...
Actually I am stuck in a problem where I have a dataframe with 2 columns having schema actions column actually contains as array of objects but it' ...
I have csv data coming as DStreams from traffic counters. Sample is as follows I want to calculate average speed (for each location) by vehicle cat ...
I created a docker container from this BigDL image. when I tried to collect the predictions using collect() this error occurs: Py4JJavaError: An error ...
I have html file which I want to parse in pySpark. Example: but in my Notebook output I have problem with list elements. They are parsed inсorre ...
I trying to collect the values of a pyspark dataframe column in databricks as a list. When I use the collect function , I get a list with extra va ...