简体   繁体   中英

How to subset one row in dask.dataframe?

I am trying to select only one row from a dask.dataframe by using command x.loc[0].compute() . It returns 4 rows with all having index=0 . I tried reset_index , but there will still be 4 rows having index=0 after resetting. (I think I did reset correctly because I did reset_index(drop=False) and I could see the original index in the new column).

I read dask.dataframe document and it says something along the line that there might be more than one rows with index=0 due to how dask structuring the chunk data.

So, if I really want only one row by using index=0 for subsetting, how can I do this?

Edit Probably, your problem comes from reset_index . This issue is explained at the end of the answer. Earlier part of the text is just how to solve it.

For example, there is the following dask DataFrame:

import pandas as pd
import dask
import dask.dataframe as dd


df = pd.DataFrame({'col_1': [1,2,3,4,5,6,7], 'col_2': list('abcdefg')}, 
                  index=pd.Index([0,0,1,2,3,4,5]))
df = dd.from_pandas(df, npartitions=2)
df.compute()
Out[1]: 
   col_1 col_2
0      1     a
0      2     b
1      3     c
2      4     d
3      5     e
4      6     f
5      7     g

it has a numerical index with repeated 0 values. As loc is a

Purely label-location based indexer for selection by label

- it selects both 0 -labeled values, if you'll do a

df.loc[0].compute()
Out[]: 
   col_1 col_2
0      1     a
0      2     b

- you'll get all the rows with 0 -s (or another specified label).

In pandas there is a pd.DataFrame.iloc which helps us to select a row by it's numerical index. Unfortunately, in dask you can't do so, because the iloc is

Purely integer-location based indexing for selection by position.

Only indexing the column positions is supported. Trying to select row positions will raise a ValueError.

To beat this problem, you can do some indexing tricks:

df.compute()
Out[2]: 
   index  col_1 col_2
x                    
0      0      1     a
1      0      2     b
2      1      3     c
3      2      4     d
4      3      5     e
5      4      6     f
6      5      7     g

- now, there's new index ranged from 0 to the length of the data frame - 1 .

It's possible to slice it with the loc and do the following (I suppose that select 0 label via loc means "select first row"):

df.loc[0].compute()
Out[3]: 
   index  col_1 col_2
x                    
0      0      1     a

About multiplicated 0 index label
If you need original index, it's still here an it could be accessed through the

df.loc[:, 'index'].compute()
Out[4]: 
x
0    0
1    0
2    1
3    2
4    3
5    4
6    5

I guess, you get such a duplication from reset_index() or so, because it genretates new 0-started index for each partition, for example, for this table of 2 partitions:

df.reset_index().compute()
Out[5]: 
   index  col_1 col_2
0      0      1     a
1      0      2     b
2      1      3     c
3      2      4     d
0      3      5     e
1      4      6     f
2      5      7     g

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM