简体   繁体   English

Pandas Dataframe 使用前面的行生成变量

[英]Pandas Dataframe generate variables using previous rows

I am attempting to generate more variables for my dataset.我正在尝试为我的数据集生成更多变量。 My data is stored in multiple files, and using pandas I can only read a single file at once because of the memory limitations.我的数据存储在多个文件中,使用 pandas 由于 memory 的限制,我一次只能读取一个文件。 Each csv file has the data for a single month and goes something like this:每个 csv 文件都有一个月的数据,如下所示:


Index     Date          Sender     Recipient     Quantity     Type
------------------------------------------------------------------------
79XT     26-03-19       Adam       Tiffany       72           Box
57ZY     14-03-19       Josh       Ross          13           Snack
29UQ     19-03-19       Adam       Alex          60           Fruit
56PY     06-03-19       Lucy       Alex          29           Book
41BR     28-03-19       Josh       Steve         33           Snack

Now I am trying to generate more feature for each row based on the history of each sender and join these features to the dataframe.现在我正在尝试根据每个发件人的历史记录为每一行生成更多功能,并将这些功能加入 dataframe。 For example:例如:

Index     Date          Sender     Recipient     Quantity     Type     Days Since          Days Since         Cumulative      Quantity Increase          First Shipment   
                                                                       Previous Shipment   First Shipment     Quantity        from Previous Shipment     to This Recipient?
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
79XT     26-03-19       Adam       Tiffany       72           Box      7                   62                 1792            12                         0
57ZY     14-03-19       Josh       Ross          13           Snack    NaN                 NaN                13              NaN                        1  
29UQ     19-03-19       Adam       Alex          60           Fruit    5                   55                 1730            -7                         1
56PY     06-03-19       Lucy       Alex          29           Book     23                  32                 88              -4                         0          
41BR     28-03-19       Josh       Steve         33           Snack    14                  14                 46              20                         1

As you can see from the desired dataframe above, the new variables are generated based on the sender's previous observations.从上面想要的 dataframe 可以看出,新变量是根据发送者之前的观察生成的。 What is the least computationally expensive way of generating such features?生成此类特征的计算成本最低的方法是什么? I will need to obtain information from all my monthly csv files to gather such data.我需要从我所有的每月 csv 文件中获取信息以收集此类数据。 There are over 200,000 unique senders, so it will take weeks to read the csv files and produce a dataframe and a csv file for every unique sender and merge this data with the monthly csv files. There are over 200,000 unique senders, so it will take weeks to read the csv files and produce a dataframe and a csv file for every unique sender and merge this data with the monthly csv files. I am aware of dask and dask distributed, but I want to find out if there is a simpler way for me to implement what I am trying to do.我知道 dask 和 dask 分布式,但我想知道是否有更简单的方法来实现我正在尝试做的事情。

I see multiple sub-problems in your problem.我在您的问题中看到了多个子问题。

df = df.merge(df.groupby("sender").agg(first_occurence_date=("Date","min"))["sender", "first_occurrence_date"], on="sender", how="left")
# Computationally likely inefficient, and doesn't solve multiple file-issue immediately.
  • Computationally efficient solutions: For fast reading, consider using .feather as an efficient storing format.计算高效的解决方案:为了快速阅读,考虑使用.feather作为一种高效的存储格式。 The standard for this changes, so always keep a .csv as backup.此更改的标准,因此请始终保留.csv作为备份。 You can write a file as feather like this df.to_feather("filename")您可以像这样将文件写成羽毛状df.to_feather("filename")

Consider factoring your strings with pd.factorize() as described in the Pandas Docs: pd.Factorize() - I have not seen benchmarks on this, but comparing int is faster than string .考虑使用pd.factorize()分解您的字符串,如Pandas 文档中所述: pd.Factorize() - 我还没有看到这方面的基准,但比较intstring更快。

Lastly, consider setting up a small sqlite3 database that reads the individual files and stores them.最后,考虑设置一个小型 sqlite3 数据库来读取单个文件并存储它们。 Otherwise, getting the first occurrence will be a pain, because you have to keep overwriting the old value and do a computationally expensive operation multiple times.否则,第一次出现会很痛苦,因为您必须不断覆盖旧值并多次执行计算量大的操作。

Here I have a different approach.在这里,我有一个不同的方法。 I'd try to我会尝试

  1. convert all csv to parquet (Eventually see this answer ) changing dtypes .将所有csv转换为parquet (最终看到这个答案)改变dtypes At least至少
df['Date'] = df['Date'].astype("M8")

or或者

df['Date'] = pd.to_datetime(df['Date'])
  1. Partition again by Sender.由发件人再次分区。 I'm assuming all parquet files are on processed folder.我假设所有镶木地板文件都在processed的文件夹中。
import dask.dataframe as dd
df = dd.read_parquet('processed')
df.to_parquet('processed2', partition_on='Sender')
  1. Now you have many files in every Sender=username you should merge all of them to a single file现在您在每个Sender=username中有许多文件,您应该将所有文件合并到一个文件中

  2. You can now create your function for every Sender=username您现在可以为每个Sender=username创建 function

def fun(df):
    df = df.sort_values("Date")
    df["Day Since Prev Shipment"] = df["Date"].diff().dt.days
    df["Day Since First Shipment"](df["Date"] - df["Date"].min()).dt.days
    df["Cumulative Quantity"] = df["Quantity"].cumsum() 
    df["Quantity difference"] = df["Quantity"].diff()
    grp = df.groupby("Recipient")["Date"].min().reset_index(name="First Shipment")
    df = pd.merge(df, grp, how="left", on="Recipient")
    df["First Shipment"] = (df["Date"]==df["First Shipment"]).astype("int8")
    return df

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM