Tag[apache-spark-dataset] Recent Newest Questions

Is there a way to modify each grouped dataset as a whole in Spark?

I have this Dataset and I'd like a more flexible way of grouping and editing the grouped data. As an example I wanted to remove the second Random_Text ...

Spark Dataset with dynamically typed/struct column

I have a spark Dataset with known columns that therefor can be "casted" as a Dataset of a case class. e.g. case class Record(id: string, occurredAt: ...

Do I need to cache a Dataset if it is reused only twice?

I'm working with Apache Spark and I have the following code: Dataset<Row> tradesDataset = sparkSession .sql("select * from a_table") ...

Getting Splunk search result directly from Apache Spark

Small question regarding an integration between Splunk and Apache Spark. Currently, I am doing a search query in Splunk. The result is quite big. And ...

How to get the numeric value of missing values in a PySpark column?

I am working with the OpenFoodFacts dataset using PySpark. There's quite a lot of columns which are entirely made up of missing values and I want to d ...

How wide transformations are influenced by shuffle partition config

How does wide transformations actually work based on shuffle partitions configuration? If I have following program: Does it mean sort would output ...

Spark Scala Dataset cannot use agg function

I am trying to get the scala and spark work with datasets and aggregation functions. Based on the mapFunctionToTheSchema (which returns multiple re ...

Spark Dataset using case class

When we have to convert a Spark Dataframe to Dataset. We generally use case class. It means we are converting a Row of un-Type to Type. Example: Le ...

In Java spark, how to select columns based on index

how can i only select 2nd and 5th column from a csv file (no column name in file) in java spark, code as below: ...

Two big files join as one to many relationship in Java Spark

I have two big files email file attachment file For simplicity say NOTE: Broadcast variable join has already performed with email file wit ...

How to Iterate Dataset column of dense rank to create Array of another column in Scala?

My Input looks like below: Required output: My error_codes column in the output dataset is a Seq of strings. I need to make an array, can change S ...

Filtering a spark dataset

in a spark session val spark = SparkSession .builder() .appName("Spark SQL basic example") .config("spark.some.config.option", "some-value") ...

Find the data type of a structfield in dataset

I have a dataset, which has the following schema: I want to access the datatype of each StructField. E.g. if the data type of col_name_1 is NullTyp ...

How to return the median of a column in a dataset?

I want to select the median value of one of a dataset column (the median being the value located at the middle of a set of values ranked in an ascendi ...

Spark AQE not helping with dataset skew join

I'm facing a problem in spark where 2 skewed datasets takes too long to join. One(or two) of the datasets has skewed data in it and it's used as the j ...

error writing a dataset/dataframe. how to create correctly a table spark?

I want to write a dataset/dataframe to a csv after performing several transformations(union) to the original dataset/dataframe. The dataset/dataframe ...

Add multiple columns with map logic without using UDF

I want to parse the address column from the given table structure using addressParser function to get number, street, city and country. Sample Input: ...

Using Spark converting nested json with optional fields to Scala case class not working

I have a use case where I need to read a json file or json string using spark as Dataset[T] in scala. The json file has nested elements and some of th ...

How to check column data type in spark

I have one imputation method to do mean, median and mode operation but this getting failed if column data type is not in Double/Float. My java code: ...

In what situations are Datasets preferred to Dataframes and vice-versa in Apache Spark?

I have been searching for any links or documents or articles that will help me understand when do we go for Datasets over Dataframes and vice-versa? ...