How can find size of each Row in Apache spark sql dataframe and discrad the rows having size more than a threshold size in Kilobyte

Question

I am new in apache spark sql in scala.

How can find size of each Row in Apache spark sql dataframe and discrad the rows having size more than a threshold size in Kilobyte. I am looking for scala solution.

Answer 1

This is actually kind of a tricky problem. Spark SQL uses columnar data Storage so thinking of individual row sizes isn't super natural. We can of course call .rdd on from there you can filter the resulting RDD using the techniques as from Calculate size of Object in Java to determine the object size, and then you can take your RDD of Rows and convert it back to a DataFrame using your SQLContext.

Answer 2

The solution that I propose is in Java , Maven and with Spark 2.4.0 but can be adapted to Scala easily.

You must have the following structure, otherwise you will have to adapt your pom.xml to your project structure:

src
--main
----java
------size
--------Sizeof.java
------spark
--------SparkJavaTest.java
----resources
------META-INF
--------MANIFEST.MF

pom.xml

    <?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.formation.SizeOf</groupId>
    <artifactId>SizeOf</artifactId>
    <version>1.0-SNAPSHOT</version>

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-assembly-plugin</artifactId>
            <configuration>
                <archive>
                    <manifestFile>
                        src/main/resources/META-INF/MANIFEST.MF
                    </manifestFile>
                    <manifest>
                        <addClasspath>true</addClasspath>
                        <mainClass>
                            spark.SparkJavaTest
                        </mainClass>
                    </manifest>
                </archive>
                <descriptorRefs>
                    <descriptorRef>jar-with-dependencies</descriptorRef>
                </descriptorRefs>
            </configuration>
            <executions>
                <execution>
                    <id>make-assembly</id> <!-- this is used for inheritance merges -->
                    <phase>package</phase> <!-- bind to the packaging phase -->
                    <goals>
                        <goal>single</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>
    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.4.0</version>
        </dependency>
    </dependencies>
</project>

Sizeof

package size;

import java.lang.instrument.Instrumentation;

final public class Sizeof {
    private static Instrumentation instrumentation;

    public static void premain(String args, Instrumentation inst) {
        instrumentation = inst;
    }

    public static long sizeof(Object o) {
        return instrumentation.getObjectSize(o);
    }
}

SparkJavaTest

package spark;

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import size.Sizeof;

public class SparkJavaTest {
    public static SparkSession spark = SparkSession
            .builder()
            .appName("JavaSparkTest")
            .master("local")
            .getOrCreate();


    public static void main(String[] args) {

        Dataset<Row> ds = spark.read().option("header",true).csv("sample.csv");

        ds.show(false);
// Get the size of a dataset
        System.out.println("size of ds " + Sizeof.sizeof(ds));

        JavaRDD dsToJavaRDD = ds.toJavaRDD();
// Get the size of a JavaRDD
        System.out.println("size of rdd" + Sizeof.sizeof(dsToJavaRDD));

    }
}

MANIFEST.MF

Manifest-Version: 1.0
Premain-Class: size.Sizeof
Main-Class: spark.SparkJavaTest

After that, you clean and package :

mvn clean package

Then you can run and get the size of your objects:

java -javaagent:target/SizeOf-1.0-SNAPSHOT-jar-with-dependencies.jar -jar target/SizeOf-1.0-SNAPSHOT-jar-with-dependencies.jar

How can find size of each Row in Apache spark sql dataframe and discrad the rows having size more than a threshold size in Kilobyte

Question

2 answers

solution1
4 ACCPTED 2015-06-03 00:14:15

solution2
0 2018-12-06 21:52:39

How can find size of each Row in Apache spark sql dataframe and discrad the rows having size more than a threshold size in Kilobyte

Question

2 answers

solution1 4 ACCPTED 2015-06-03 00:14:15

solution2 0 2018-12-06 21:52:39

solution1
4 ACCPTED 2015-06-03 00:14:15

solution2
0 2018-12-06 21:52:39