[英]Extract data from a dataframe in pyspark
I am trying to create a script in pyspark which will take the min and max dates from a table store them in a df, then split these two values into 2 variables and then place these variables as a time range in another query.我正在尝试在 pyspark 中创建一个脚本,它将从表中获取最小和最大日期将它们存储在 df 中,然后将这两个值拆分为 2 个变量,然后将这些变量作为时间范围放置在另一个查询中。 My problem is that dates is a dataframe like this
我的问题是日期是这样的数据框
+--------+--------+
| maxDate| minDate|
+--------+--------+
|20210701|20210629|
+--------+--------+
And I want only the values of the maxDate and minDate.我只想要 maxDate 和 minDate 的值。
I tried dates.iloc[0]
and var1 = dates['maxDate'].values[0]
but it didn't worked.我试过
dates.iloc[0]
和var1 = dates['maxDate'].values[0]
但它没有用。
from pyspark.sql import SparkSession
from pyspark.sql.functions import when
from pyspark.sql import functions as F
from pyspark.sql.functions import lit
from pyspark.sql.functions import trim
from datetime import datetime
current_timestamp = datetime.strftime(datetime.now(), "%Y%m%d%H%M")
spark = SparkSession.builder.appName("testing") \
.config("hive.exec.dynamic.partition", "true") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.config("hive.exec.compress.output=false", "false") \
.config("spark.unsafe.sorter.spill.read.ahead.enabled", "false") \
.config("spark.debug.maxToStringFields", 1000)\
.enableHiveSupport() \
.getOrCreate()
spark.sql("set max_row_size = 6mb")
dates = spark.sql("SELECT MAX(date) as maxDate, MIN(date) as minDate FROM db.table")
#dates must be split here in two separated vars
result = spark.sql("select * from db.table_2 where date between {} and {}".format(var1,var2)
You can do like below你可以像下面这样
max_date = df.collect()[0][0]
min_date = df.collect()[0][1]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.