![](/img/trans.png)
[英]scheduled 1:1 copy of tables between Azure SQL databases using Data Factory without specifying table schema
[英]Column based comparison between two tables in different databases using pyspark
我在 Snowflake 中有兩個數據庫,DB1 & DB2。數據從 DB1 遷移到 DB2,因此架構和表名相同。
假設 DB1.SCHEMA_1.TABLE_1 具有以下數據:
STATE_ID STATE
1 AL
2 AN
3 AZ
4 AR
5 CA
6 AD
7 PN
8 AP
9 JH
10 TX
12 LA
和
假設 DB2.SCHEMA_1.TABLE_1 有這個數據:
STATE_ID STATE
1 AL
2 AK
3 AZ
4 AR
5 AC
6 AD
7 GP
8 AP
9 JH
10 HA
它們都多了一列“record_created_timestamp”,但我將其放入代碼中。 我編寫了一個 pyspark 腳本,該腳本將執行在 Aws Glue 作業中運行的基於列的比較。 我從這個鏈接得到幫助: Generate a report of mismatch Columns between 2 Pyspark dataframes
我的代碼是:
import sys
from pyspark.sql.session import SparkSession
from pyspark.sql import Row
from pyspark.sql.functions import concat, col, lit, to_timestamp, when
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
from py4j.java_gateway import java_import
import os
from pyspark.sql.types import *
from pyspark.sql.functions import substring
from pyspark.sql.functions import array, count, first
import json
import datetime
import time
import boto3
from botocore.exceptions import ClientError
now = datetime.datetime.now()
year = now.strftime("%Y")
month = now.strftime("%m")
day = now.strftime("%d")
glueClient = boto3.client('glue')
ssmClient = boto3.client('ssm')
region = os.environ['AWS_DEFAULT_REGION']
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'CONNECTION_INFO', 'TABLE_NAME', 'BUCKET_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
client = boto3.client("secretsmanager", region_name="us-east-1")
get_secret_value_response = client.get_secret_value(
SecretId=args['CONNECTION_INFO']
)
secret = get_secret_value_response['SecretString']
secret = json.loads(secret)
db_username = secret.get('db_username')
db_password = secret.get('db_password')
db_warehouse = secret.get('db_warehouse')
db_url = secret.get('db_url')
db_account = secret.get('db_account')
db_name = secret.get('db_name')
db_schema = secret.get('db_schema')
logger = glueContext.get_logger()
logger.info('Fetching configuration.')
job.init(args['JOB_NAME'], args)
java_import(spark._jvm, SNOWFLAKE_SOURCE_NAME)
spark._jvm.net.snowflake.spark.snowflake.SnowflakeConnectorUtils.enablePushdownSession(spark._jvm.org.apache.spark.sql.SparkSession.builder().getOrCreate())
sfOptions = {
"sfURL" : db_url,
"sfAccount" : db_account,
"sfUser" : db_username,
"sfPassword" : db_password,
"sfSchema" : db_schema,
"sfDatabase" : db_name,
"sfWarehouse" : db_warehouse
}
print(f'database: {db_name}')
print(f'db_warehouse: {db_warehouse}')
print(f'db_schema: {db_schema}')
print(f'db_account: {db_account}')
table_name = args['TABLE_NAME']
bucket_name = args['BUCKET_NAME']
MySql_1 = f"""
select * from DB1.SCHEMA_1.TABLE_1
"""
df = spark.read.format("snowflake").options(**sfOptions).option("query", MySql_1).load()
df1 = df.drop('record_created_timestamp')
MySql_2 = f"""
select * from DB2.SCHEMA_1.TABLE_1
"""
df2 = spark.read.format("snowflake").options(**sfOptions).option("query", MySql_2).load()
df3 = df.drop('record_created_timestamp')
# list of columns to be compared
cols = df1.columns[1:]
df_new = (df1.join(df3, "state_id", "outer")
.select([ when(~df1[c].eqNullSafe(df3[c]), array(df1[c], df3[c])).alias(c) for c in cols ])
.selectExpr('stack({},{}) as (Column_Name, mismatch)'.format(len(cols), ','.join('"{0}",`{0}`'.format(c) for c in cols)))
.filter('mismatch is not NULL'))
df_newv1 = df_new.selectExpr('Column_Name', 'mismatch[0] as Mismatch_In_DB1_Table', 'mismatch[1] as Mismatch_In_DB2_Table')
df_newv1.show()
SNOWFLAKE_SOURCE_NAME = "snowflake"
job.commit()
這為我提供了正確的 output:
Column_Name Mismatch_In_DB1_Table Mismatch_In_DB2_Table
STATE AN AK
STATE CA AC
STATE PN GP
STATE TX HA
如果我使用 STATE 而不是 STATE_ID 進行外部連接
df_new = (df1.join(df2, "state", "outer")
它顯示此錯誤。
AnalysisException: 'Resolved attribute(s) STATE#1,STATE#9 missing from STATE#14,STATE_ID#0,STATE_ID#8 in operator !Project [CASE WHEN NOT (STATE#1 <=> STATE#9) THEN array(STATE#1, STATE#9) END AS STATE#18]. Attribute(s) with the same name appear in the operation: STATE,STATE. Please check if the right attribute(s) are used.;;\n!Project [CASE WHEN NOT (STATE#1 <=> STATE#9) THEN array(STATE#1, STATE#9) END AS STATE#18]\n+- Project [coalesce(STATE#1, STATE#9) AS STATE#14, STATE_ID#0, STATE_ID#8]\n +- Join FullOuter, (STATE#1 = STATE#9)\n :- Project [STATE_ID#0, STATE#1]\n : +- Relation[STATE_ID#0,STATE#1,RECORD_CREATED_TIMESTAMP#2] SnowflakeRelation\n +- Relation[STATE_ID#8,STATE#9] SnowflakeRelation\n
我將不勝感激對此的解釋,並想知道即使我將 STATE 作為密鑰,是否有一種方法可以運行。 或者如果有一些其他代碼我可以通過它獲得相同的 output 而不會出現此錯誤,那也會有所幫助。
似乎 Spark 對兩個 dfs 的列名感到困惑。 嘗試給它們起別名以確保它們匹配:
df1 = df.drop('record_created_timestamp')\
.select(df.STATE_ID.alias('state_id'), df.STATE.alias('state'))
df3 = df.drop('record_created_timestamp')\
.select(df.STATE_ID.alias('state_id'), df.STATE.alias('state'))
另外,確保“STATE_ID”在此列的名稱中沒有空格/特殊字符
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.