簡體   English   中英

將 Prophet 或 Auto ARIMA 與 Ray 一起使用

[英]Using Prophet or Auto ARIMA with Ray

關於雷,我找不到明確的答案。 Ray 是一個用於數據處理和訓練的分布式框架。 為了使其以分布式方式工作,必須使用 Modin 或 Ray 支持的其他一些分布式數據分析工具,以便數據可以在整個集群上流動,但是如果我想使用像 Facebook 的 Prophet 或 ARIMA 這樣的模型熊貓數據框作為輸入? 當我使用 Pandas 數據幀作為模型函數的參數時,它是否只在單個節點上工作,或者是否有可能的解決方法使其在集群上工作?

Ray 能夠使用 Pandas 數據幀作為輸入來訓練模型!

目前,ARIMA 需要一些小的解決方法,因為它通常在幕后使用 statsmodels 庫。 為了確保模型正確序列化,需要一個額外的 pickle 步驟。 Ray 將來可能會消除對泡菜變通方法的需要。

請參閱泡菜解決方法的說明: https : //alkaline-ml.com/pmdarima/1.0.0/serialization.html

這是python 3.8和ray 1.8的代碼摘錄。 請注意,train_model() 和 inference_model() 函數的輸入是 pandas 數據幀。 額外的泡菜步驟嵌入在這些函數中。 https://gist.github.com/christy/fd0c00409f28f5db8824be447e7a393f

import ray
import pandas as pd
import pmdarima as pm
from pmdarima.model_selection import train_test_split

# read 8 months of clean, aggregated monthly taxi data
filename = "https://github.com/christy/MachineLearningTools/blob/master/data/clean_taxi_monthly.parquet?raw=true"
g_month = pd.read_parquet(filename) 

# Define a train_model function, default train on 6 months, inference 2
def train_model(theDF:pd.DataFrame, item_col:str
                , item_value:str, target_col:str
                , train_size:int=6) -> list:

    # split data into train/test
    train, test = train_test_split(theDF.loc[(theDF[item_col]==item_value), :], train_size=train_size)
    
    # train and fit auto.arima model
    model = pm.auto_arima(y=train[target_col]
                          ,X=train.loc[:, (train.columns!=target_col) 
                                          & (train.columns!=item_col)]
                         )
    # here is the extra pickle step to handle arima's statsmodel objects
    return [train, test, pickle.dumps(model)]


# Define inference_model function
def inference_model(model_pickle:bytes, test:pd.DataFrame
                    , timestamp_col:str, item_col:str, target_col:str) -> pd.DataFrame:

    # unpickle the model - shouldn't need this except for statsmodel objects
    model = pickle.loads(model_pickle)
    
    # inference on test data
    forecast = pd.DataFrame(model.predict(n_periods=test.shape[0]
                         , X=test.loc[:, (test.columns!=target_col) & (test.columns!=item_col)]
                         , index=test.index))
    
    return forecast


# start-up ray on your laptop for testing purposes
import ray
NUM_CPU = 2
ray.init(
    ignore_reinit_error=True
    , num_cpus = NUM_CPU
)

###########
# run your training as distributed jobs by using ray remote function calls
###########
    
# Convert your regular python functions to ray remote functions
train_model_remote = ray.remote(train_model).options(num_returns=3)  
inference_model_remote = ray.remote(inference_model)
    
# Train every model
item_list = list(g_month['pulocationid'].unique())
model = []
train = []
test = []

for p,v in enumerate(item_list):
    # ray lazy eval
    temp_train, temp_test, temp_model = \
        train_model_remote.remote(g_month
                                  , item_col='pulocationid', item_value=v
                                  , target_col='trip_quantity'
                                  , train_size=6)
    train.append(temp_train)
    test.append(temp_test)
    model.append(temp_model)

# Inference every test dataset
result=[]
for p,v in enumerate(item_list):
    # ray lazy eval
    result.append(inference_model_remote.remote(model[p], test[p]
                                                , timestamp_col='pickup_monthly'
                                                , item_col='pulocationid'
                                                , target_col='trip_quantity'))

# ray blocking step, to get the forecasts
forecast = ray.get(result)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM