Tag[spark-dataframe] Recent Newest Questions

Creating a Spark Vector Column with createDataFrame

I can make a Spark DataFrame with a vector column with the toDF method. I'm not sure how to create a vector column with the createDataFrame method. ...

Create DataFrame from case class

I have read other related questions but I do not find the answer. I want to create a DataFrame from a case class in Spark 2.3. Scala 2.11.8. Code ...

Generate Dataframe with various number of columns

Imagine a csv as follow : I want to obtain automatically a DF with 4 columns a,b,c,d. A manual technique can be : The problem with this techni ...

Fast searching a Pandas dataframe column

I have a Pandas dataframe with one column containing string IDs. I am using idxmax() to return the index of the found IDs but since the data is over a ...

using Typeclasses on SparkTypes

I am trying to use scala TypeClass on Spark Types, here is a small code snippet I wrote. When I run this on my local intellij, following error is ...

Applying conditional trimmed mean in scala

I am trying to achieve 80% trimmed mean for every group in scala to get rid of the outliers. But this has to applied only if the number of records are ...

Spark grouped Dataframe details

How can I achieve keys of a grouped spark-dataframe? And another question: What does a pyspark.sql.group.GroupedData object include? ...

Saving Order of a Dataframe to HDFS

Input Data: Code After reading the data into DF with columns key,data,value I am trying to order the column by column key and drop the same ...

Spark: get distinct in each partition

I want to partition data using ID, and with in each partition I want to -apply a set of operations -take distinct Doing distinct within each parti ...

How to assign a String variable to a dataframe name

I had a problem, which is a for loop program.like below: but the "new_df_name" is just a Variable and String type. how to realize these? ...

Pyspark Cast StructType as ArrayType<StructType>

I have a dataframe df1 with a column col1 that has structure : and another dataframe df2 with col1 that has structure: Inorder to union df1.unio ...

Spark Streaming - Dstream messages in json format to DataFrame

I'm trying to read Kafka topics through Apache Spark Streaming and am not able to figure out how to transform the data in DStream to DataFrame and the ...

How to append new data to existing Hive table using Spark data frame and partitionby clause

I have a dataframe which I am writing to Hive table using partitionBy - If I create another dataframe and want to append the content of this data f ...

Spark SQL 2.3 - Slow search results for LIKE '%message%'

I get log4j format logs, process them and store them in Spark. I am not in clustered or multi node environment. Using Spark as a single node applicati ...

Disable spark catalyst optimizer

To give some background, I am trying to run TPCDS benchmark on Spark with and without Spark's catalyst optimizer. For complicated queries on smaller d ...

Left Outer join for unequla records fro two data frames in spark scala

I have two data frame . Data Frame one Data Frame Two is Now i want to add all columns of data frame one two data frame except for the records ...

Dataframe column name is not updated using alias

I'm doing some kind of aggregation on the dataframe I have created. Here are the steps HowEver when I do a printSchema on my newly created DataFra ...

spark streaming persistent table updates

I have a spark structured streaming application (listening to kafka) that is also reading from a persistent table in s3 I am trying to have each micro ...

How to replace empty values in a column of DataFrame?

How can I replace empty values in a column Field1 of DataFrame df? This command does not provide an expected result: The expected result: ...

Filter a column based on multiple conditions: Scala Spark

I having trouble trying to filter rows in a column based on multiple conditions. Basically I'm storing my multiple conditions in an array and I want t ...