简体   繁体   中英

Is is possible limit the number of rows in the output of a Dataprep flow?

I'm using Dataprep on GCP to wrangle a large file with a billion rows. I would like to limit the number of rows in the output of the flow, as I am prototyping a Machine Learning model.

Let's say I would like to keep one million rows out of the original billion. Is this possible to do this with Dataprep? I have reviewed the documentation of sampling, but that only applies to the input of the Transformer tool and not the outcome of the process.

You can do this, but it does take a bit of extra work in your Recipe--set up a formula in a new column using something like RANDBETWEEN to give you a random integer output between 1 and 1,000 (in this million-to-billion case). From there, you can filter rows based on whatever random integer between 1 and 1,000 as what you'll keep, and then your output will only have your randomized subset. Just have your last part of the recipe remove this temporary column.

So indeed there are 2 approaches to this.

As Courtney Grimes said, you can use one of the 2 functions that create random-number out of a range.

  1. randbetween :

在此处输入图片说明

  1. rand :

在此处输入图片说明

These methods can be used to slice an "even" portion of your data. As suggested, a randbetween(1,1000) , then pick 1<x<1000 to filter, because it's 1\\1000 of data (million out of a billion).

Alternatively, if you just want to have million records in your output, but either

  1. Don't want to rely on the knowledge of the size of the entire table
  2. just want the first million rows, agnostic to how many rows there are -

You can just use 2 of these 3 row filtering methods: (top rows\\ range)

在此处输入图片说明

PS By understanding the $sourcerownumber metadata parameter (can read in-product documentation), you can filter\\keep a portion of the data (as per the first scenario) in 1 step (AKA without creating an additional column.

在此处输入图片说明

BTW, an easy way of "discovery" of how-to's in Trifacta would be to just type what you're looking for in the "search-transtormation" pane (accessed via ctrl-k). By searching "filter", you'll get most of the relevant options for your problem. Cheers!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM