[英]How to compute difference between timestamps with PySpark Structured Streaming
[英]Pyspark join with functions and difference between timestamps
我正在尝试将 2 个表与用户事件连接起来。 我想通过 user_id (id) 将 table_a 与 table_b 连接起来,并且当差异时间戳小于 5s (5000ms) 时。
这是我正在做的事情:
table_a = (
table_a
.join(
table_b,
table_a.uid == table_b.uid
& abs(table_b.b_timestamp - table_a.a_timestamp) < 5000
& table_a.a_timestamp.isNotNull()
,
how = 'left'
)
)
我收到 2 个错误:
错误 1) ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
如果我删除连接上的第二个条件并仅保留第一个和第三个条件时出现错误 2: org.apache.spark.sql.AnalysisException: cannot resolve '(
uid AND (
a_timestamp IS NOT NULL))' due to data type mismatch: differing types in '(
IS NOT NULL))' due to data type mismatch: differing types in '(
uid AND (
a_timestamp IS NOT NULL))' (string and boolean).;;
中的不同类型IS NOT NULL))' (string and boolean).;;
任何帮助深表感谢!
您只需要在每个过滤条件周围加上括号。 例如,以下工作:
df1 = spark.createDataFrame([
(1, 20),
(1, 21),
(1, 25),
(1, 30),
(2, 21),
], ['id', 'val'])
df2 = spark.createDataFrame([
(1, 21),
(2, 30),
], ['id', 'val'])
df1.join(
df2,
(df1.id == df2.id)
& (abs(df1.val - df2.val) < 5)
).show()
# +---+---+---+---+
# | id|val| id|val|
# +---+---+---+---+
# | 1| 20| 1| 21|
# | 1| 21| 1| 21|
# | 1| 25| 1| 21|
# +---+---+---+---+
但没有括号:
df1.join(
df2,
df1.id == df2.id
& abs(df1.val - df2.val) < 5
).show()
# ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.