简体   繁体   English

使用 PySpark 根据年龄生成子样本

[英]Generate subsample based on age using PySpark

I wanted to collect sample based on age with a condition on the Failure status.我想根据年龄收集样本,条件是失败状态。 I am interested in 3 days old serial number.我对 3 天前的序列号感兴趣。 However, I don't need healthy serial number that is less than 3 days old, but I want to include all failed serial numbers that are less than 3 days old or exactly 3 days old.但是,我不需要少于 3 天的健康序列号,但我想包括所有少于 3 天或刚好 3 天的失败序列号。 For example, C failed in January 3rd, so I need to include January 1st and 2nd for Serial C in my new sample.例如,C 在 1 月 3 日失败,因此我需要在我的新样本中包括序列号 C 的 1 月 1 日和 2 日。 Serial D failed in January 4th, so I need January 3rd, 2nd, and 1st data for D. For A and B, I need 5th, 4th, and 3rd January data that is total 3 days.系列 D 在 1 月 4 日失败,所以我需要 D 的 1 月 3 日、2 日和 1 日的数据。对于 A 和 B,我需要 1 月 5 日、4 日和 3 日的数据,总共 3 天。 I don't E and F as they are younger than 3 days healthy observations.我没有 E 和 F,因为它们还不到 3 天的健康观察结果。 In summary, I need failed samples with 3 days before the actual failure and recent most recent 3 days of healthy observations.总之,我需要实际失败前 3 天的失败样本和最近 3 天的健康观察结果。

url="https://gist.githubusercontent.com/JishanAhmed2019/6625009b71ade22493c256e77e1fdaf3/raw/8b51625b76a06f7d5c76b81a116ded8f9f790820/FailureSample.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
df=spark.read.csv(SparkFiles.get("FailureSample.csv"), header=True,sep='\t')

Current format:当前格式:

在此处输入图像描述

Expected sample:预期样本:

在此处输入图像描述

For me description is a bit tricky and i am not sure if i understood it correctly对我来说,描述有点棘手,我不确定我是否理解正确

I tried to do it with window functions and i was able to get similar results but i am not sure if this code is good enough我试着用 window 函数来做,我得到了类似的结果,但我不确定这段代码是否足够好

import pyspark.sql.functions as F
from pyspark.sql import Window

spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")

inputDf = spark.read.csv(
    "dbfs:/FileStore/shared_uploads/****.com/Observations_stackoverflow.txt",
    header=True,
    sep="\t",
).withColumn("date", F.to_date("date", "MM/dd/yyyy"))

window = Window.partitionBy("serial_number")
aggregatedDf = (
    inputDf.withColumn("max_date", F.max("date").over(window))
    .withColumn("min_date", F.min(F.col("date")).over(window))
    .withColumn("withFailure", F.max(F.col("Failure")).over(window))
)

# If failure get up to 4 last records (failure + 3 older records)
# If success skip codes with entries for less than 3 days and get up to 3 records
filteredDf = aggregatedDf.filter(
    ((F.col("withFailure") == "1") & (F.datediff(F.col("max_date"), F.col("date")) < 4))
    | (
        (F.col("withFailure") == "0")
        & (F.datediff(F.col("max_date"), F.col("min_date")) > 3)
        & (F.datediff(F.col("max_date"), F.col("date")) < 3)
    )
)

filteredDf.drop("max_date", "min_date", "withFailure").show()

Output is: Output 是:

+----------+-------------+---------+---------+-------+
|      date|serial_number|Feature 1|Feature 2|Failure|
+----------+-------------+---------+---------+-------+
|2022-01-03|            A|      171|       76|      0|
|2022-01-04|            A|      241|      100|      0|
|2022-01-05|            A|      311|      124|      0|
|2022-01-03|            B|      188|       82|      0|
|2022-01-04|            B|      258|      106|      0|
|2022-01-05|            B|      328|      130|      0|
|2022-01-01|            C|       83|       10|      0|
|2022-01-02|            C|      136|       64|      0|
|2022-01-03|            C|      223|       94|      1|
|2022-01-01|            D|       80|       47|      0|
|2022-01-02|            D|      153|       70|      0|
|2022-01-03|            D|      206|       88|      0|
|2022-01-04|            D|      293|      118|      1|
+----------+-------------+---------+---------+-------+

In the output only ordering is different, records are the same as in sample output output中只有顺序不同,记录与样本output相同

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM