简体   繁体   中英

Google Cloud - What products for time series data cleaning?

I have around 20TB of time series data stored in big query.

The current pipeline I have is:

raw data in big query => joins in big query to create more big query datasets => store them in buckets

Then I download a subset of the files in the bucket:

Work on interpolation/resampling of data using Python/SFrame, because some of the time series data have missing times and they are not evenly sampled.

However, it takes a long time on a local PC, and I'm guessing it will take days to go through that 20TB of data.


Since the data are already in buckets, I'm wondering what would the best Google tools for interpolation and resampling?

After resampling and interpolation I might use Facebook's Prophet or Auto ARIMA to create some forecasts. But that would be done locally.


There's a few services from Google that seems are like good options.

  1. Cloud DataFlow: I have no experience in Apache Beam, but it looks like the Python API with Apache Beam have missing functions compared to the Java version? I know how to write Java, but I'd like to use one programming language for this task.

  2. Cloud DataProc: I know how to write PySpark, but I don't really need any real time processing or stream processing, however spark has time series interpolation, so this might be the only option?

  3. Cloud Dataprep: Looks like a GUI for cleaning data, but it's in beta. Not sure if it can do time series resampling/interpolation.

Does anyone have any idea which might best fit my use case?

Thanks

I would use PySpark on Dataproc, since Spark is not just realtime/streaming but also for batch processing.

You can choose the size of your cluster (and use some preemptibles to save costs) and run this cluster only for the time you actually need to process this data. Afterwards kill the cluster.

Spark also works very nicely with Python (not as nice as Scala) but for all effects and purposes the main difference is performance, not reduced API functionality.

Even with the batch processing you can use the WindowSpec for effective time serie interpolation

To be fair: I don't have a lot of experience with DataFlow or DataPrep, but that's because out use case is somewhat similar to yours and Dataproc works well for that

Cloud Dataflow is a batch data processing, Cloud Dataproc is a managed Spark and Hadoop service and Cloud Dataprep is used to Transform/Clean raw data. All of them can be used to perform interpolation/resampling of data.

I would discard Cloud Dataprep . It might change in backward-incompatible ways because is in beta release. The main difference between Cloud Dataflow and Cloud Dataproc is the cluster management capabilities in the last one. If you do not expect a clear comeback by managing clusters, Cloud Dataflow is the product in which you can set up the mentioned operations in the easiest way.

Apache Beam Java version is older than Python version since Apache Beam 1.X supports only Java. The new 2.X version supports both languages with no apparent Python / Java difference.

You will find useful this Dataflow timeseries example in Java if you decide that Dataflow is the best suited option.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM