I am new in apache spark sql in scala.
How can find size of each Row in Apache spark sql dataframe and discrad the rows having size more than a threshold size in Kilobyte. I am looking for scala solution.
This is actually kind of a tricky problem. Spark SQL uses columnar data Storage so thinking of individual row sizes isn't super natural. We can of course call .rdd on from there you can filter the resulting RDD using the techniques as from Calculate size of Object in Java to determine the object size, and then you can take your RDD of Rows and convert it back to a DataFrame using your SQLContext.
The solution that I propose is in Java
, Maven
and with Spark 2.4.0
but can be adapted to Scala easily.
You must have the following structure, otherwise you will have to adapt your pom.xml to your project structure:
src
--main
----java
------size
--------Sizeof.java
------spark
--------SparkJavaTest.java
----resources
------META-INF
--------MANIFEST.MF
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.formation.SizeOf</groupId>
<artifactId>SizeOf</artifactId>
<version>1.0-SNAPSHOT</version>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<archive>
<manifestFile>
src/main/resources/META-INF/MANIFEST.MF
</manifestFile>
<manifest>
<addClasspath>true</addClasspath>
<mainClass>
spark.SparkJavaTest
</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id> <!-- this is used for inheritance merges -->
<phase>package</phase> <!-- bind to the packaging phase -->
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.0</version>
</dependency>
</dependencies>
</project>
Sizeof
package size;
import java.lang.instrument.Instrumentation;
final public class Sizeof {
private static Instrumentation instrumentation;
public static void premain(String args, Instrumentation inst) {
instrumentation = inst;
}
public static long sizeof(Object o) {
return instrumentation.getObjectSize(o);
}
}
SparkJavaTest
package spark;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import size.Sizeof;
public class SparkJavaTest {
public static SparkSession spark = SparkSession
.builder()
.appName("JavaSparkTest")
.master("local")
.getOrCreate();
public static void main(String[] args) {
Dataset<Row> ds = spark.read().option("header",true).csv("sample.csv");
ds.show(false);
// Get the size of a dataset
System.out.println("size of ds " + Sizeof.sizeof(ds));
JavaRDD dsToJavaRDD = ds.toJavaRDD();
// Get the size of a JavaRDD
System.out.println("size of rdd" + Sizeof.sizeof(dsToJavaRDD));
}
}
MANIFEST.MF
Manifest-Version: 1.0
Premain-Class: size.Sizeof
Main-Class: spark.SparkJavaTest
After that, you clean and package :
mvn clean package
Then you can run and get the size of your objects:
java -javaagent:target/SizeOf-1.0-SNAPSHOT-jar-with-dependencies.jar -jar target/SizeOf-1.0-SNAPSHOT-jar-with-dependencies.jar
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.