如何在Pyspark结构化流中处理时间戳

Question

I'm trying to parse the datetime to later do group by at certain hours in structured streaming. 我正在尝试解析日期时间，以便以后在结构化流中的某些小时进行分组。

Currently I have code like this: 目前，我有这样的代码：

distinct_table = service_table\
        .select(psf.col('crime_id'),
                psf.col('original_crime_type_name'),
                psf.to_timestamp(psf.col('call_date_time')).alias('call_datetime'),
                psf.col('address'),
                psf.col('disposition'))

Which gives output in console: 在控制台中提供输出：

+---------+------------------------+-------------------+--------------------+------------+
| crime_id|original_crime_type_name|      call_datetime|             address| disposition|
+---------+------------------------+-------------------+--------------------+------------+
|183652852|                Burglary|2018-12-31 18:52:00|600 Block Of Mont...|         HAN|
|183652839|            Passing Call|2018-12-31 18:51:00|500 Block Of Clem...|         HAN|
|183652841|                  22500e|2018-12-31 18:51:00|2600 Block Of Ale...|         CIT|

When I try to apply this udf to convert the timestamp (call_datetime column): 当我尝试应用此udf转换时间戳（call_datetime列）时：

import pyspark.sql.functions as psf
from dateutil.parser import parse as parse_date

@psf.udf(StringType())
def udf_convert_time(timestamp):
    d = parse_date(timestamp)
    return str(d.strftime('%y%m%d%H'))

I get a Nonetype error.. 我收到一个Nonetype错误。

File "/Users/dev/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 229, in main
    process()
  File "/Users/dev/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 224, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/Users/dev/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 149, in <lambda>
    func = lambda _, it: map(mapper, it)
  File "<string>", line 1, in <lambda>
  File "/Users/dev/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 74, in <lambda>
    return lambda *a: f(*a)
  File "/Users/PycharmProjects/data-streaming-project/solution/streaming/data_stream.py", line 29, in udf_convert_time
    d = parse_date(timestamp)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/dateutil/parser.py", line 697, in parse
    return DEFAULTPARSER.parse(timestr, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/dateutil/parser.py", line 301, in parse
    res = self._parse(timestr, **kwargs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/dateutil/parser.py", line 349, in _parse
    l = _timelex.split(timestr)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/dateutil/parser.py", line 143, in split
    return list(cls(s))
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/dateutil/parser.py", line 137, in next
    token = self.get_token()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/dateutil/parser.py", line 68, in get_token
    nextchar = self.instream.read(1)
AttributeError: 'NoneType' object has no attribute 'read'

This is the query plan: 这是查询计划：

pyspark.sql.utils.StreamingQueryException: u'Writing job aborted.\n=== Streaming Query ===\nIdentifier: [id = 958a6a46-f718-49c4-999a-661fea2dc564, runId = fc9a7a78-c311-42b7-bbed-7718b4cc1150]\nCurrent Committed Offsets: {}\nCurrent Available Offsets: {KafkaSource[Subscribe[service-calls]]: {"service-calls":{"0":200}}}\n\nCurrent State: ACTIVE\nThread State: RUNNABLE\n\nLogical Plan:\nProject [crime_id#25, original_crime_type_name#26, call_datetime#53, address#33, disposition#32, udf_convert_time(call_datetime#53) AS parsed_time#59]\n+- Project [crime_id#25, original_crime_type_name#26, to_timestamp(\'call_date_time, None) AS call_datetime#53, address#33, disposition#32]\n   +- Project [SERVICE_CALLS#23.crime_id AS crime_id#25, SERVICE_CALLS#23.original_crime_type_name AS original_crime_type_name#26, SERVICE_CALLS#23.report_date AS report_date#27, SERVICE_CALLS#23.call_date AS call_date#28, SERVICE_CALLS#23.offense_date AS offense_date#29, SERVICE_CALLS#23.call_time AS call_time#30, SERVICE_CALLS#23.call_date_time AS call_date_time#31, SERVICE_CALLS#23.disposition AS disposition#32, SERVICE_CALLS#23.address AS address#33, SERVICE_CALLS#23.city AS city#34, SERVICE_CALLS#23.state AS state#35, SERVICE_CALLS#23.agency_id AS agency_id#36, SERVICE_CALLS#23.address_type AS address_type#37, SERVICE_CALLS#23.common_location AS common_location#38]\n      +- Project [jsontostructs(StructField(crime_id,StringType,true), StructField(original_crime_type_name,StringType,true), StructField(report_date,StringType,true), StructField(call_date,StringType,true), StructField(offense_date,StringType,true), StructField(call_time,StringType,true), StructField(call_date_time,StringType,true), StructField(disposition,StringType,true), StructField(address,StringType,true), StructField(city,StringType,true), StructField(state,StringType,true), StructField(agency_id,StringType,true), StructField(address_type,StringType,true), StructField(common_location,StringType,true), value#21, Some(America/Los_Angeles)) AS SERVICE_CALLS#23]\n         +- Project [cast(value#8 as string) AS value#21]\n            +- StreamingExecutionRelation KafkaSource[Subscribe[service-calls]], [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13]\n'

I'm using StringType for all columns and using to_timestamp for timestamp columns (which seems to work). 我为所有列使用StringType，为时间戳列使用to_timestamp （这似乎起作用）。

I verified and all the data I'm using (just like 100 rows) do have values. 我已验证，并且我正在使用的所有数据（就像100行一样）都有值。 Any idea how to debug this? 知道如何调试吗？

EDIT 编辑

Input is coming from Kafka - Schema is shown above in the error log (all StringType()) 输入来自Kafka-模式显示在上面的错误日志中（所有StringType（））

Answer 1

Best not to use udf because they don't use spark catalyst optimizer and especially when the spark.sql.functions modules have functions available. 最好不要使用udf因为它们不使用火花catalyst optimizer ，尤其是在spark.sql.functions模块具有可用功能的情况下。 This code will transform your timestamp . 此代码将改变您的timestamp 。

import pyspark.sql.functions as F
import pyspark.sql.types as T

rawData = [(183652852, "Burglary", "2018-12-31 18:52:00", "600 Block Of Mont", "HAN"),
           (183652839, "Passing Call", "2018-12-31 18:51:00", "500 Block Of Clem", "HAN"),
           (183652841, "22500e", "2018-12-31 18:51:00", "2600 Block Of Ale", "CIT")]

df = spark.createDataFrame(rawData).toDF("crime_id",\
                                         "original_crime_type_name",\
                                         "call_datetime",\
                                         "address",\
                                         "disposition")

date_format_source="yyyy-MM-dd HH:mm:ss"
date_format_target="yyyy-MM-dd HH"

df.select("*")\
.withColumn("new_time_format",\
            F.from_unixtime(F.unix_timestamp(F.col("call_datetime"),\
                                             date_format_source),\
                            date_format_target)\
            .cast(T.TimestampType()))\
.withColumn("time_string", F.date_format(F.col("new_time_format"), "yyyyMMddHH"))\
.select("call_datetime", "new_time_format", "time_string")\
.show(truncate=True)

+-------------------+-------------------+-----------+
|      call_datetime|    new_time_format|time_string|
+-------------------+-------------------+-----------+
|2018-12-31 18:52:00|2018-12-31 18:00:00| 2018123118|
|2018-12-31 18:51:00|2018-12-31 18:00:00| 2018123118|
|2018-12-31 18:51:00|2018-12-31 18:00:00| 2018123118|
+-------------------+-------------------+-----------+

如何在Pyspark结构化流中处理时间戳

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-08-05 14:53:58

如何在Pyspark结构化流中处理时间戳

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-08-05 14:53:58

解决方案1
2 已采纳 2019-08-05 14:53:58