“格式錯誤的數據長度為負”，當嘗試將來自 kafka 的 Spark 結構化流與 Avro 數據源結合使用時

Question

所以我一直在嘗試使用 Kafka 和 Avro 數據結構化流 Avro 的Angel Conde 的結構化流

然而，我的數據似乎有點復雜，其中包含嵌套數據。 這是我的代碼，

private static Injection<GenericRecord, byte[]> recordInjection;
private static StructType type;
private static final String SNOQTT_SCHEMA = "{"
        +"\"type\": \"record\","
        +"\"name\": \"snoqttv2\","
        +"\"fields\": ["
        +"    { \"name\": \"src_ip\", \"type\": \"string\" },"
        +"    { \"name\": \"classification\", \"type\": \"long\" },"
        +"    { \"name\": \"device_id\", \"type\": \"string\" },"
        +"    { \"name\": \"alert_msg\", \"type\": \"string\" },"
        +"    { \"name\": \"src_mac\", \"type\": \"string\" },"
        +"    { \"name\": \"sig_rev\", \"type\": \"long\" },"
        +"    { \"name\": \"sig_gen\", \"type\": \"long\" },"
        +"    { \"name\": \"dest_mac\", \"type\": \"string\" },"
        +"    { \"name\": \"packet_info\", \"type\": {"
        +"        \"type\": \"record\","
        +"        \"name\": \"packet_info\","
        +"        \"fields\": ["
        +"              { \"name\": \"DF\", \"type\": \"boolean\" },"
        +"              { \"name\": \"MF\", \"type\": \"boolean\" },"
        +"              { \"name\": \"ttl\", \"type\": \"long\" },"
        +"              { \"name\": \"len\", \"type\": \"long\" },"
        +"              { \"name\": \"offset\", \"type\": \"long\" }"
        +"          ],"
        +"        \"connect.name\": \"packet_info\" }},"
        +"    { \"name\": \"timestamp\", \"type\": \"string\" },"
        +"    { \"name\": \"sig_id\", \"type\": \"long\" },"
        +"    { \"name\": \"ip_type\", \"type\": \"string\" },"
        +"    { \"name\": \"dest_ip\", \"type\": \"string\" },"
        +"    { \"name\": \"priority\", \"type\": \"long\" }"
        +"],"
        +"\"connect.name\": \"snoqttv2\" }";

private static Schema.Parser parser = new Schema.Parser();
private static Schema schema = parser.parse(SNOQTT_SCHEMA);

static {
    recordInjection = GenericAvroCodecs.toBinary(schema);
    type = (StructType) SchemaConverters.toSqlType(schema).dataType();
}

public static void main(String[] args) throws StreamingQueryException{
    // Set log4j untuk development langsung dari java
    LogManager.getLogger("org.apache.spark").setLevel(Level.WARN);
    LogManager.getLogger("akka").setLevel(Level.ERROR);

    // Set konfigurasi untuk streamcontext dan sparkcontext
    SparkConf conf = new SparkConf()
            .setAppName("Snoqtt-Avro-Structured")
            .setMaster("local[*]");

    // Inisialisasi spark session
    SparkSession sparkSession = SparkSession
            .builder()
            .config(conf)
            .getOrCreate();

    // Reduce task number
    sparkSession.sqlContext().setConf("spark.sql.shuffle.partitions", "3");

    // Mulai data stream di kafka
    Dataset<Row> ds1 = sparkSession
            .readStream()
            .format("kafka")
            .option("kafka.bootstrap.servers", "localhost:9092")
            .option("subscribe", "snoqttv2")
            .option("startingOffsets", "latest")
            .load();

    // Mulai streaming query

    sparkSession.udf().register("deserialize", (byte[] data) -> {
        GenericRecord record = recordInjection.invert(data).get();
        return RowFactory.create(
                record.get("timestamp").toString(),
                record.get("device_id").toString(),
                record.get("ip_type").toString(),
                record.get("src_ip").toString(),
                record.get("dest_ip").toString(),
                record.get("src_mac").toString(),
                record.get("dest_mac").toString(),
                record.get("alert_msg").toString(),
                record.get("sig_rev").toString(),
                record.get("sig_gen").toString(),
                record.get("sig_id").toString(),
                record.get("classification").toString(),
                record.get("priority").toString());
    }, DataTypes.createStructType(type.fields()));

    ds1.printSchema();
    Dataset<Row> ds2 = ds1
            .select("value").as(Encoders.BINARY())
            .selectExpr("deserialize(value) as rows")
            .select("rows.*");

    ds2.printSchema();

    StreamingQuery query1 = ds2
            .groupBy("sig_id")
            .count()
            .writeStream()
            .queryName("Signature ID Count Query")
            .outputMode("complete")
            .format("console")
            .start();

    query1.awaitTermination();
}

在我收到第一批消息之前，這一切都很有趣和游戲，但遇到了錯誤

18/01/22 14:29:00 錯誤執行器：階段 4.0 中任務 0.0 中的異常（TID 8）org.apache.spark.SparkException：無法執行用戶定義的函數（$anonfun$27：（二進制）=> 結構，時間戳:字符串,sig_id:bigint,ip_type:string,dest_ip:string,priority:bigint>) 在 ...

引起：com.twitter.bijection.InversionFailure：無法反轉：[B@232f8415 at ...

引起：org.apache.avro.AvroRuntimeException：格式錯誤的數據。 長度為負數：-25 at ...

我做錯了嗎？ 或者我的嵌套模式是我代碼中的邪惡根源？ 感謝你們的任何幫助

Answer 1

剛剛使用嵌套模式和使用新的 avro 數據源的示例更新了 repo。 回購

在使用新數據源之前，我嘗試使用雙射庫並遇到與您發布的錯誤相同的錯誤，但修復了刪除 Kafka 臨時文件夾以重置舊排隊數據的問題。

最好的

“格式錯誤的數據長度為負”，當嘗試將來自 kafka 的 Spark 結構化流與 Avro 數據源結合使用時

問題描述

1 個解決方案

解決方案1
2 2020-03-26 17:11:34

“格式錯誤的數據長度為負”，當嘗試將來自 kafka 的 Spark 結構化流與 Avro 數據源結合使用時

問題描述

1 個解決方案

解決方案1 2 2020-03-26 17:11:34

解決方案1
2 2020-03-26 17:11:34