简体   繁体   English

使用AWS Glue或PySpark过滤DynamicFrame

[英]Filtering DynamicFrame with AWS Glue or PySpark

I have a table in my AWS Glue Data Catalog called 'mytable'. 我的AWS Glue数据目录中有一个名为“ mytable”的表。 This table is in an on-premises Oracle database connection 'mydb'. 该表位于本地Oracle数据库连接“ mydb”中。

I'd like to filter the resulting DynamicFrame to only rows where the X_DATETIME_INSERT column (which is a timestamp) is greater than a certain time (in this case, '2018-05-07 04:00:00'). 我想将生成的DynamicFrame过滤到X_DATETIME_INSERT列(为时间戳)大于特定时间(在这种情况下为'2018-05-07 04:00:00')的行中。 Afterwards, I'm trying to count the rows to ensure that the count is low (the table is about 40,000 rows, but only a few rows should meet the filter criteria). 之后,我尝试对行进行计数以确保计数较低(表大约为40,000行,但只有少数几行符合过滤条件)。

Here is my current code: 这是我当前的代码:

import boto3
from datetime import datetime
import logging
import os
import pg8000
import pytz
import sys
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from base64 import b64decode
from pyspark.context import SparkContext
from pyspark.sql.functions import lit
## @params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "mydb", table_name = "mytable", transformation_ctx = "datasource0")

# Try Glue native filtering    
filtered_df = Filter.apply(frame = datasource0, f = lambda x: x["X_DATETIME_INSERT"] > '2018-05-07 04:00:00')
filtered_df.count()

This code runs for 20 minutes and times out. 该代码将运行20分钟并超时。 I've tried other variations: 我尝试了其他变体:

df = datasource0.toDF()
df.where(df.X_DATETIME_INSERT > '2018-05-07 04:00:00').collect()

And

df.filter(df["X_DATETIME_INSERT"].gt(lit("'2018-05-07 04:00:00'")))

Which have failed. 哪些失败了。 What am I doing wrong? 我究竟做错了什么? I'm experienced in Python but new to Glue and PySpark. 我有Python经验,但是对Glue和PySpark陌生。

AWS Glue loads entire dataset from your JDBC source into temp s3 folder and applies filtering afterwards. AWS Glue将来自JDBC源的整个数据集加载到temp s3文件夹中,然后应用过滤。 If your data was in s3 instead of Oracle and partitioned by some keys (ie. /year/month/day) then you could use pushdown-predicate feature to load a subset of data: 如果您的数据在s3中而不是Oracle中,并按某些键进行分区(即/ year / month / day),则可以使用下推谓词功能加载数据的子集:

val partitionPredicate = s"to_date(concat(year, '-', month, '-', day)) BETWEEN '${fromDate}' AND '${toDate}'"

val df = glueContext.getCatalogSource(
   database = "githubarchive_month",
   tableName = "data",
   pushDownPredicate = partitionPredicate).getDynamicFrame()

Unfortunately, this doesn't work for JDBC data sources yet. 不幸的是,这不适用于JDBC数据源。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM