如何遍历具有已排序数字索引的数据框唯一行的列值，并在熊猫中进行重复？

Question

I have a pandas DataFrame with the sorted, numerical index with duplicates, and the column values are identical for the same values of the index in the given column. 我有一个带有重复的排序数字索引的pandas DataFrame ，对于给定列中相同索引值，列值相同。 I would like to iterate through the values of the given column for the unique values of the index. 我想遍历给定列的值以获取索引的唯一值。

Example 例

df = pd.DataFrame({'a': [3, 3, 5], 'b': [4, 6, 8]}, index=[1, 1, 2])

   a  b
1  3  4
1  3  6
2  5  8

I want to iterate through the values in column a for the unique entries in the index - [3,5] . 我想遍历索引a [3,5]唯一条目的a列中的值。

When I iterate using the default index and print the type for column a , I get the Series entries for the duplicate index entries. 当我使用默认index进行迭代并打印a列的类型时，我得到了重复索引条目的Series条目。

for i in df.index:
    cell_value = df['a'].loc[i]
    print(type(cell_value))

Output: 输出：

<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'numpy.int64'>

Answer 1

First remove duplicated index by mask and assign positions by arange , then select with iloc : 首先通过面罩删除重复的指标，并指定由位置arange ，然后选择iloc ：

arr = np.arange(len(df.index))
a = arr[~df.index.duplicated()]
print (a)
[0 2]

for i in a:
    cell_value = df['a'].iloc[i]
    print(type(cell_value))

<class 'numpy.int64'>
<class 'numpy.int64'>

No loop solution - use boolean indexing with duplicated and inverted mask by ~ : 无循环解决方案-将boolean indexing与~和duplicated和反转掩码一起使用：

a = df.loc[~df.index.duplicated(), 'a']
print (a)
1    3
2    5
Name: a, dtype: int64

b = df.loc[~df.index.duplicated(), 'a'].tolist()
print (b)
[3, 5]

print (~df.index.duplicated())
[ True False  True]

Answer 2

Try np.unique : 试试np.unique ：

_, i = np.unique(df.index, return_index=True)
df.iloc[i, df.columns.get_loc('a')].tolist() 

[3, 5]

Answer 3

This seems an XY Problem if, as per your comment, same index means same data. 如果按照您的评论，如果相同的索引表示相同的数据，则这似乎是XY问题。

You also don't need a loop for this. 您也不需要为此循环。

Assuming you want to remove duplicate rows and extract the first column only (ie 3, 5), the below should suffice. 假设您要删除重复的行并仅提取第一列（即3、5），则下面的内容就足够了。

res = df.drop_duplicates().loc[:, 'a']

# 1    3
# 2    5
# Name: a, dtype: int64

To return types: 要返回类型：

types = list(map(type, res))

print(types)
# [<class 'numpy.int64'>, <class 'numpy.int64'>]

Answer 4

Another solution using groupby and apply: 另一种使用groupby的解决方案并应用：

df.groupby(level=0).apply(lambda x: type(x.a.iloc[0]))
Out[330]: 
1    <class 'numpy.int64'>
2    <class 'numpy.int64'>
dtype: object

To make your loop solution to work, create a temp df: 为了使您的循环解决方案能够正常工作，请创建一个临时df：

df_new = df.groupby(level=0).first()
for i in df_new.index:
    cell_value = df_new['a'].loc[i]
    print(type(cell_value))

<class 'numpy.int64'>
<class 'numpy.int64'>

Or to use drop_duplicates() 或使用drop_duplicates（）

for i in df.drop_duplicates().index:
    cell_value = df.drop_duplicates()['a'].loc[i]
    print(type(cell_value))

<class 'numpy.int64'>
<class 'numpy.int64'>

如何遍历具有已排序数字索引的数据框唯一行的列值，并在熊猫中进行重复？

问题描述

4 个解决方案

解决方案1
2 已采纳 2018-03-06 12:32:15

解决方案2
2 2018-03-06 12:40:06

解决方案3
0 2018-03-06 12:33:00

解决方案4
0 2018-03-06 12:43:55

如何遍历具有已排序数字索引的数据框唯一行的列值，并在熊猫中进行重复？

问题描述

4 个解决方案

解决方案1 2 已采纳 2018-03-06 12:32:15

解决方案2 2 2018-03-06 12:40:06

解决方案3 0 2018-03-06 12:33:00

解决方案4 0 2018-03-06 12:43:55

解决方案1
2 已采纳 2018-03-06 12:32:15

解决方案2
2 2018-03-06 12:40:06

解决方案3
0 2018-03-06 12:33:00

解决方案4
0 2018-03-06 12:43:55