用Pyspark将时间戳记写入Postgres

Question

I'm working on a Spark script on Python (using Pyspark). 我正在使用Python（使用Pyspark）开发Spark脚本。 I have a function that returns a Ro w with some fields, including 我有一个返回的功能Ro W的一些领域，包括

timestamp=datetime.strptime(processed_data[1], DATI_REGEX)

processed_data[1] is a valid datetime string. 处理的数据[1]是有效的日期时间字符串。

Edit to show complete code: 编辑以显示完整的代码：

DATI_REGEX = "%Y-%m-%dT%H:%M:%S"

class UserActivity(object):
    def __init__(self, user, rows):
        self.user = int(user)
        self.rows = sorted(rows, key=operator.attrgetter('timestamp'))

    def write(self):
        return Row(
            user=self.user,
            timestamp=self.rows[-1].timestamp,
        )

def parse_log_line(logline):
    try:
       entries = logline.split('\\t')
       processed_data = entries[0].split('\t') + entries[1:]

       return Row(
           ip_address=processed_data[9],
           user=int(processed_data[10]),
           timestamp=datetime.strptime(processed_data[1], DATI_REGEX),
       )
     except (IndexError, ValueError):
          return None


logFile = sc.textFile(...)
rows = (log_file.map(parse_log_line).filter(None)
        .filter(lambda x: current_day <= x.timestamp < next_day))
user_rows = rows.map(lambda x: (x.user, x)).groupByKey()
user_dailies = user_rows.map(lambda x: UserActivity(current_day, x[0], x[1]).write())

The problem comes when I try to write that on a PostgreSQL DB, doing the following: 当我尝试在PostgreSQL数据库上执行以下操作时，就会出现问题：

fields = [
    StructField("user_id", IntegerType(), False),
    StructField("timestamp", TimestampType(), False),
]
schema = StructType(fields)
user_dailies_schema = SQLContext(sc).createDataFrame(user_dailies, schema)
user_dailies_schema.write.jdbc(
    "jdbc:postgresql:.......",
    "tablename")

I get the following error: 我收到以下错误：

Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/Users/pau/Downloads/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 172, in main
    process()
  File "/Users/pau/Downloads/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 167, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/Users/pau/Downloads/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/Users/pau/Downloads/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", line 576, in toInternal
  File "/Users/pau/Downloads/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", line 576, in <genexpr>
  File "/Users/pau/Downloads/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", line 436, in toInternal
    return self.dataType.toInternal(obj)
  File "/Users/pau/Downloads/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", line 190, in toInternal
    seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
AttributeError: 'int' object has no attribute 'tzinfo'

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
    at org.apache.spark.scheduler.Task.run(Task.scala:86)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    ... 1 more

Any idea on how to solve that? 有什么解决办法的想法吗？

Answer 1

The problem is relatively simple. 这个问题比较简单。 PySpark Row is a tuple ordered by field names. PySpark Row是一个按字段名称排序的tuple 。 It means that when you create: 这意味着在创建时：

Row(user=self.user, timestamp=self.rows[-1].timestamp)

the output structure is ordered as: 输出结构的顺序为：

Row(timestamp, user)

StructType from the other hand is ordered as is. 另一方面， StructType排序。 As a result you're code is tries to use user id as a timestamp. 结果，您正在尝试使用用户ID作为时间戳。 You should either return a plain tuple : 您应该返回一个简单的tuple ：

class UserActivity(object):
    ...
    def write(self):
        return (self.user, timestamp)

or use lexicographically ordered schema: 或使用按字典顺序排序的模式：

schema = StructType(sorted(fields, key=operator.attrgetter("name")))

Finally you can use namedtuple to achieve both attribute access and predefined order. 最后，您可以使用namedtuple来实现属性访问和预定义的顺序。

On a side note don't use groupByKey like this. groupByKey像这样使用groupByKey 。 It is a typical case when one would use reduceByKey : 当人们使用reduceByKey时是一种典型的情况：

(log_file.map(parse_log_line)
    .map(operator.attrgetter("user", "timestamp"))
    .reduceByKey(max))

with multiple fields: 有多个字段：

from functools import partial

(log_file.map(parse_log_line)
    .map(lambda x: (x.user, x))
    .reduceByKey(partial(max, key=operator.itemgetter("timestamp")))
    .values())

or DataFrame aggregations: 或DataFrame聚合：

from pyspark.sql import functions as f

(sqlContext
    .createDataFrame(
        log_file.map(parse_log_line)
          # Another way to handle ordering is to choose fields
          # before you call createDataFrame
          .map(operator.attrgetter("user", "timestamp")),
        schema)
    .groupBy("user_id")
    .agg(f.max("timestamp").alias("timestamp")))

Also if you want to retrieve SQLContext you should use the factory method: 另外，如果要检索SQLContext ，则应使用工厂方法：

SQLContext.getOrCreate(sc)

Creating new context like you do can have unexpected side effects. 像您一样创建新的上下文会产生意外的副作用。

用Pyspark将时间戳记写入Postgres

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-01-19 15:12:16

用Pyspark将时间戳记写入Postgres

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-01-19 15:12:16

解决方案1
1 已采纳 2017-01-19 15:12:16