I am trying to select only one row from a dask.dataframe by using command x.loc[0].compute()
. It returns 4 rows with all having index=0
. I tried reset_index
, but there will still be 4 rows having index=0
after resetting. (I think I did reset correctly because I did reset_index(drop=False)
and I could see the original index in the new column).
I read dask.dataframe
document and it says something along the line that there might be more than one rows with index=0
due to how dask structuring the chunk data.
So, if I really want only one row by using index=0
for subsetting, how can I do this?
Edit Probably, your problem comes from reset_index
. This issue is explained at the end of the answer. Earlier part of the text is just how to solve it.
For example, there is the following dask DataFrame:
import pandas as pd
import dask
import dask.dataframe as dd
df = pd.DataFrame({'col_1': [1,2,3,4,5,6,7], 'col_2': list('abcdefg')},
index=pd.Index([0,0,1,2,3,4,5]))
df = dd.from_pandas(df, npartitions=2)
df.compute()
Out[1]:
col_1 col_2
0 1 a
0 2 b
1 3 c
2 4 d
3 5 e
4 6 f
5 7 g
it has a numerical index with repeated 0
values. As loc
is a
Purely label-location based indexer for selection by label
- it selects both 0
-labeled values, if you'll do a
df.loc[0].compute()
Out[]:
col_1 col_2
0 1 a
0 2 b
- you'll get all the rows with 0
-s (or another specified label).
In pandas
there is a pd.DataFrame.iloc
which helps us to select a row by it's numerical index. Unfortunately, in dask you can't do so, because the iloc
is
Purely integer-location based indexing for selection by position.
Only indexing the column positions is supported. Trying to select row positions will raise a ValueError.
To beat this problem, you can do some indexing tricks:
df.compute()
Out[2]:
index col_1 col_2
x
0 0 1 a
1 0 2 b
2 1 3 c
3 2 4 d
4 3 5 e
5 4 6 f
6 5 7 g
- now, there's new index ranged from 0
to the length of the data frame - 1
.
It's possible to slice it with the loc
and do the following (I suppose that select 0
label via loc
means "select first row"):
df.loc[0].compute()
Out[3]:
index col_1 col_2
x
0 0 1 a
About multiplicated 0 index label
If you need original index, it's still here an it could be accessed through the
df.loc[:, 'index'].compute()
Out[4]:
x
0 0
1 0
2 1
3 2
4 3
5 4
6 5
I guess, you get such a duplication from reset_index()
or so, because it genretates new 0-started index for each partition, for example, for this table of 2 partitions:
df.reset_index().compute()
Out[5]:
index col_1 col_2
0 0 1 a
1 0 2 b
2 1 3 c
3 2 4 d
0 3 5 e
1 4 6 f
2 5 7 g
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.