简体   繁体   中英

How can find size of each Row in Apache spark sql dataframe and discrad the rows having size more than a threshold size in Kilobyte

I am new in apache spark sql in scala.

How can find size of each Row in Apache spark sql dataframe and discrad the rows having size more than a threshold size in Kilobyte. I am looking for scala solution.

This is actually kind of a tricky problem. Spark SQL uses columnar data Storage so thinking of individual row sizes isn't super natural. We can of course call .rdd on from there you can filter the resulting RDD using the techniques as from Calculate size of Object in Java to determine the object size, and then you can take your RDD of Rows and convert it back to a DataFrame using your SQLContext.

The solution that I propose is in Java , Maven and with Spark 2.4.0 but can be adapted to Scala easily.

You must have the following structure, otherwise you will have to adapt your pom.xml to your project structure:

src
--main
----java
------size
--------Sizeof.java
------spark
--------SparkJavaTest.java
----resources
------META-INF
--------MANIFEST.MF

pom.xml

    <?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.formation.SizeOf</groupId>
    <artifactId>SizeOf</artifactId>
    <version>1.0-SNAPSHOT</version>

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-assembly-plugin</artifactId>
            <configuration>
                <archive>
                    <manifestFile>
                        src/main/resources/META-INF/MANIFEST.MF
                    </manifestFile>
                    <manifest>
                        <addClasspath>true</addClasspath>
                        <mainClass>
                            spark.SparkJavaTest
                        </mainClass>
                    </manifest>
                </archive>
                <descriptorRefs>
                    <descriptorRef>jar-with-dependencies</descriptorRef>
                </descriptorRefs>
            </configuration>
            <executions>
                <execution>
                    <id>make-assembly</id> <!-- this is used for inheritance merges -->
                    <phase>package</phase> <!-- bind to the packaging phase -->
                    <goals>
                        <goal>single</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>
    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.4.0</version>
        </dependency>
    </dependencies>
</project>

Sizeof

package size;

import java.lang.instrument.Instrumentation;

final public class Sizeof {
    private static Instrumentation instrumentation;

    public static void premain(String args, Instrumentation inst) {
        instrumentation = inst;
    }

    public static long sizeof(Object o) {
        return instrumentation.getObjectSize(o);
    }
}

SparkJavaTest

package spark;

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import size.Sizeof;

public class SparkJavaTest {
    public static SparkSession spark = SparkSession
            .builder()
            .appName("JavaSparkTest")
            .master("local")
            .getOrCreate();


    public static void main(String[] args) {

        Dataset<Row> ds = spark.read().option("header",true).csv("sample.csv");

        ds.show(false);
// Get the size of a dataset
        System.out.println("size of ds " + Sizeof.sizeof(ds));

        JavaRDD dsToJavaRDD = ds.toJavaRDD();
// Get the size of a JavaRDD
        System.out.println("size of rdd" + Sizeof.sizeof(dsToJavaRDD));

    }
}

MANIFEST.MF

Manifest-Version: 1.0
Premain-Class: size.Sizeof
Main-Class: spark.SparkJavaTest

After that, you clean and package :

mvn clean package

Then you can run and get the size of your objects:

java -javaagent:target/SizeOf-1.0-SNAPSHOT-jar-with-dependencies.jar -jar target/SizeOf-1.0-SNAPSHOT-jar-with-dependencies.jar 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM