简体   繁体   English

如何查看Dask DataFrame的数据预览?

[英]How can I see the data preview of Dask DataFrame?

I created Dask DataFrame from Pandas DataFrame and applied few functions on it. 我从Pandas DataFrame创建了Dask DataFrame,并在其上应用了一些功能。 When I'm trying to view the data using 当我尝试使用查看数据时

 df.head()

it is taking too much time. 这花费了太多时间。 How can I view the dataframe? 如何查看数据框?

It really depends on what computations are behind your dataframe. 这实际上取决于数据帧背后的计算。

The df.head() command executes only those operations necessary to get a few lines of data from the dataframe. df.head()命令仅执行从数据帧中获取几行数据所需的那些操作。 Often this is very fast. 通常这是非常快的。 For example if we are reading a large dataframe from a Parquet or CSV file then we only need to load in the first chunk of data to get the first few rows. 例如,如果我们正在从Parquet或CSV文件中读取较大的数据帧,则只需加载第一个数据块即可获得前几行。

df = dd.read_csv('...')
df.head()  # this is relatively fast

However if our dataframe is more complex, maybe it is the result of a lazy shuffle or set_index operation, then we might genuinely need to read and process all of our data before we can get the first few rows. 但是,如果我们的数据帧更复杂,可能是由于懒惰洗牌或set_index操作的结果,那么我们可能真正需要读取并处理所有数据,然后才能获得前几行。

df = df.set_index('some-column')
df = df.merge(some_other_df)
df.head()  # this is slow, because it has to do the set_index and merge

You can always see metadata cheaply (column names, types, number of tasks and partitions). 您总是可以便宜地看到元数据(列名,类型,任务数和分区数)。

>>> df
Dask DataFrame Structure:
                       close     high      low     open
npartitions=505                                        
2008-01-02 09:00:00  float64  float64  float64  float64
2008-01-03 09:00:00      ...      ...      ...      ...
...                      ...      ...      ...      ...
2009-12-31 09:00:00      ...      ...      ...      ...
2009-12-31 16:00:00      ...      ...      ...      ...
Dask Name: from-delayed, 1010 tasks

Persist 坚持

If your data fits in RAM (or distributed RAM if you're on a cluster) then you should also persist to memory. 如果您的数据适合RAM(如果位于群集中,则为分布式RAM),那么您还应该保留到内存中。 This will make things very fast. 这将使事情变得非常快。

df = df.persist()

However if you don't have enough RAM then this may slow down your machine. 但是,如果您没有足够的RAM,那么这可能会降低计算机的速度。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM