[英]java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeRow
I am new in programming with Spark Structured Streaming.我是 Spark Structured Streaming 编程的新手。 I am getting this error after using this
F.approx_count_distinct
, this is my code.使用此
F.approx_count_distinct
后出现此错误,这是我的代码。 My problem is that I want to get a dataframe that detects frauds, but first of all I need to check if there are people with the same card_number
.我的问题是我想获得一个检测欺诈的 dataframe,但首先我需要检查是否有人拥有相同的
card_number
。 Can anyone help me?谁能帮我? Thanks in advance.
提前致谢。
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
from pyspark.sql.functions import from_json, col
from pyspark.sql import functions as F
from pyspark.sql.functions import when
from pyspark.sql.types import *
conf = SparkConf().setAppName("Pruebas").setMaster("local")
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
sparkSQL = SparkSession \
.builder \
.appName("SparkSQL") \
.master("local") \
.getOrCreate()
broker="localhost:9092"
topic = "transacts"
# Construir el dataframe de streaming
df = sparkSQL \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", broker) \
.option("failOnDataLoss", "false") \
.option("subscribe", topic) \
.option("startingOffsets", "latest") \
.option("includeTImestamp", "true") \
.load()
# Definir el esquema que utilizaremos en el json
schema = StructType([ StructField("card_owner", StringType(), True),
StructField("card_number", StringType(), True),
StructField("geography", StringType(), True),
StructField("target", StringType(), True),
StructField("amount", StringType(), True),
StructField("currency", StringType(), True)])
# decodificar el json
# al decodificar el json nos genera una serie de subcolumnas dentro del campo value
df = df.withColumn("value", from_json(df["value"].cast("string"), schema))
df.printSchema()
# seleccionamos el timestamp del mensaje y las columnas del json
df = df.select("timestamp","value.*")
df1 = df.groupBy(df.card_number).agg(F.approx_count_distinct(df.card_owner).alias('titulares')).filter((F.col('titulares')>1))
df1 = df1.selectExpr("'a' as key", "to_json(struct(*)) as value")
query= df1.writeStream\
.outputMode("complete")\
.format("kafka")\
.option("topic","aux_topic1")\
.option("kafka.bootstrap.servers", "localhost:9092")\
.option("checkpointLocation","hdfs://localhost:9000/checkpoints")\
.start()
#query.awaitTermination(200)
# Paso de json a df
topic1= "aux_topic1"
df1 = sparkSQL \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", broker) \
.option("failOnDataLoss", "false") \
.option("subscribe", topic1) \
.option("startingOffsets", "latest") \
.option("includeTImestamp", "true") \
.load()
# Definir el esquema que utilizaremos en el json
schema = StructType([ StructField("card_number", StringType(), True),
StructField("titulares", StringType(), True)])
# decodificar el json
df1 = df1.withColumn("value", from_json(df1["value"].cast("string"), schema))
df1.printSchema()
df1 = df1.select("timestamp","value.*")
df2=df.join(df1, on="card_number")
#Mostrar por pantalla
query1= df2.writeStream\
.outputMode("append")\
.format("console")\
.queryName("test")\
.start()
query1.awaitTermination()
The problem seems to be this line:问题似乎是这一行:
df1 = df \
.groupBy(df.card_number) \
.agg(F.approx_count_distinct(df.card_owner).alias('titulares')) \
.filter((F.col('titulares')>1))
and more precisely your filter .filter((F.col('titulares')>1))
更准确地说,您的过滤器
.filter((F.col('titulares')>1))
If you want to get all the card numbers that appear more than once the following will do the trick:如果您想获得所有出现多次的卡号,则可以使用以下方法:
This is your dataframe这是你的 dataframe
df.show()
+-----------+-------------+
|card_number| card_owner|
+-----------+-------------+
| 12345| Andrew Smith|
| 98765| John Brown|
| 12345| Andrew Smith|
| 98765| John Brown|
| 33445|Maria Johnson|
+-----------+-------------+
Now to get all the counts per card number (filtering out those with no duplicates):现在获取每个卡号的所有计数(过滤掉那些没有重复的):
>>> df \
... .groupBy('card_number') \
... .count() \
... .filter('count>1') \
... .show()
+-----------+-----+
|card_number|count|
+-----------+-----+
| 12345| 2|
| 98765| 2|
+-----------+-----+
Now if you want the card_owner
as well, then:现在,如果您也想要
card_owner
,那么:
>>> df \
... .groupBy(['card_number', 'card_owner']) \
... .count() \
... .filter('count>1') \
... .show()
+-----------+------------+-----+
|card_number| card_owner|count|
+-----------+------------+-----+
| 12345|Andrew Smith| 2|
| 98765| John Brown| 2|
+-----------+------------+-----+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.