[英]Generate subsample based on age using PySpark
I wanted to collect sample based on age with a condition on the Failure status.我想根据年龄收集样本,条件是失败状态。 I am interested in 3 days old serial number.我对 3 天前的序列号感兴趣。 However, I don't need healthy serial number that is less than 3 days old, but I want to include all failed serial numbers that are less than 3 days old or exactly 3 days old.但是,我不需要少于 3 天的健康序列号,但我想包括所有少于 3 天或刚好 3 天的失败序列号。 For example, C failed in January 3rd, so I need to include January 1st and 2nd for Serial C in my new sample.例如,C 在 1 月 3 日失败,因此我需要在我的新样本中包括序列号 C 的 1 月 1 日和 2 日。 Serial D failed in January 4th, so I need January 3rd, 2nd, and 1st data for D. For A and B, I need 5th, 4th, and 3rd January data that is total 3 days.系列 D 在 1 月 4 日失败,所以我需要 D 的 1 月 3 日、2 日和 1 日的数据。对于 A 和 B,我需要 1 月 5 日、4 日和 3 日的数据,总共 3 天。 I don't E and F as they are younger than 3 days healthy observations.我没有 E 和 F,因为它们还不到 3 天的健康观察结果。 In summary, I need failed samples with 3 days before the actual failure and recent most recent 3 days of healthy observations.总之,我需要实际失败前 3 天的失败样本和最近 3 天的健康观察结果。
url="https://gist.githubusercontent.com/JishanAhmed2019/6625009b71ade22493c256e77e1fdaf3/raw/8b51625b76a06f7d5c76b81a116ded8f9f790820/FailureSample.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
df=spark.read.csv(SparkFiles.get("FailureSample.csv"), header=True,sep='\t')
Current format:当前格式:
Expected sample:预期样本:
For me description is a bit tricky and i am not sure if i understood it correctly对我来说,描述有点棘手,我不确定我是否理解正确
I tried to do it with window functions and i was able to get similar results but i am not sure if this code is good enough我试着用 window 函数来做,我得到了类似的结果,但我不确定这段代码是否足够好
import pyspark.sql.functions as F
from pyspark.sql import Window
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
inputDf = spark.read.csv(
"dbfs:/FileStore/shared_uploads/****.com/Observations_stackoverflow.txt",
header=True,
sep="\t",
).withColumn("date", F.to_date("date", "MM/dd/yyyy"))
window = Window.partitionBy("serial_number")
aggregatedDf = (
inputDf.withColumn("max_date", F.max("date").over(window))
.withColumn("min_date", F.min(F.col("date")).over(window))
.withColumn("withFailure", F.max(F.col("Failure")).over(window))
)
# If failure get up to 4 last records (failure + 3 older records)
# If success skip codes with entries for less than 3 days and get up to 3 records
filteredDf = aggregatedDf.filter(
((F.col("withFailure") == "1") & (F.datediff(F.col("max_date"), F.col("date")) < 4))
| (
(F.col("withFailure") == "0")
& (F.datediff(F.col("max_date"), F.col("min_date")) > 3)
& (F.datediff(F.col("max_date"), F.col("date")) < 3)
)
)
filteredDf.drop("max_date", "min_date", "withFailure").show()
Output is: Output 是:
+----------+-------------+---------+---------+-------+
| date|serial_number|Feature 1|Feature 2|Failure|
+----------+-------------+---------+---------+-------+
|2022-01-03| A| 171| 76| 0|
|2022-01-04| A| 241| 100| 0|
|2022-01-05| A| 311| 124| 0|
|2022-01-03| B| 188| 82| 0|
|2022-01-04| B| 258| 106| 0|
|2022-01-05| B| 328| 130| 0|
|2022-01-01| C| 83| 10| 0|
|2022-01-02| C| 136| 64| 0|
|2022-01-03| C| 223| 94| 1|
|2022-01-01| D| 80| 47| 0|
|2022-01-02| D| 153| 70| 0|
|2022-01-03| D| 206| 88| 0|
|2022-01-04| D| 293| 118| 1|
+----------+-------------+---------+---------+-------+
In the output only ordering is different, records are the same as in sample output output中只有顺序不同,记录与样本output相同
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.