简体   繁体   English

从Dataframe中的2个或更多列获取唯一值的有效方法

[英]Efficient way to get the unique values from 2 or more columns in a Dataframe

Given a matrix from an SFrame : 给定SFrame的矩阵:

>>> from sframe import SFrame
>>> sf =SFrame({'x':[1,1,2,5,7], 'y':[2,4,6,8,2], 'z':[2,5,8,6,2]})
>>> sf
Columns:
    x   int
    y   int
    z   int

Rows: 5

Data:
+---+---+---+
| x | y | z |
+---+---+---+
| 1 | 2 | 2 |
| 1 | 4 | 5 |
| 2 | 6 | 8 |
| 5 | 8 | 6 |
| 7 | 2 | 2 |
+---+---+---+
[5 rows x 3 columns]

I want to get the unique values for the x and y columns and I can do it as such: 我想获得xy列的唯一值,我可以这样做:

>>> sf['x'].unique().append(sf['y'].unique()).unique()
dtype: int
Rows: 7
[2, 8, 5, 4, 1, 7, 6]

This way I get the unique values of x and unique values of y then append them and get the unique values of the appended list. 这样,我得到x的唯一值和y的唯一值,然后追加它们并获得附加列表的唯一值。

I could also do it as such: 我也可以这样做:

>>> sf['x'].append(sf['y']).unique()
dtype: int
Rows: 7
[2, 8, 5, 4, 1, 7, 6]

But that way, if my x and y columns are huge with lots of duplicates, I would be appending it into a very huge container before getting the unique. 但是这样,如果我的x和y列很大并且有很多重复,我会在获得唯一之前将它附加到一个非常大的容器中。

Is there a more efficient way to get the unique values of a combined columns created from 2 or more columns in an SFrame? 有没有更有效的方法来获取从SFrame中的2个或更多列创建的组合列的唯一值?

What is the equivalence in pandas of the efficent way to get unique values from 2 or more columns in pandas ? 什么是的efficent方式大熊猫等价于2列或多列获得唯一值pandas

I dont have SFrame but tested on pd.DataFrame: 我没有SFrame但在pd.DataFrame上测试过:

  sf[["x", "y"]].stack().value_counts().index.tolist()
  [2, 1, 8, 7, 6, 5, 4]

The easiest way I can think of is to convert to a numpy array then find unique values 我能想到的最简单的方法是转换为numpy数组然后找到唯一值

np.unique(sf[['x', 'y']].to_numpy())

array([1, 2, 4, 5, 6, 7, 8])

If you needed it in an sframe 如果你需要在sframe中

SFrame({'xy_unique': np.unique(sf[['x', 'y']].to_numpy())})

在此输入图像描述

SFrame SFrame

I haven't used SFrame and don't know on which conditions it copies data. 我没有使用SFrame,也不知道它复制数据的条件。 (Does selection sf['x'] or append copy data to memory?). (选择sf['x']还是append复制数据append到内存?)。 There are pack_columns and stack methods in SFrame and if they don't copy data, then this should work: pack_columnspack_columnsstack方法,如果它们不复制数据,那么这应该工作:

sf[['x', 'y']].pack_columns(new_column_name='N').stack('N').unique()

pandas 大熊猫

If your data fit into memory then you can probably do it in pandas efficiently without extra copy. 如果您的数据适合内存,那么您可以在没有额外副本的情况下有效地在pandas中执行此操作。

# copies the data to memory
df = sf[['x', 'y']].to_dataframe()

# a reference to the underlying numpy array (no copy)
vals = df.values

# 1d array: 
# (numpy.ravel doesn't copy if it doesn't have to - it depends on the data layout)
if np.isfortran(vals):
    vals_1d = vals.ravel(order='F')
else:
    vals_1d = vals.ravel(order='C')

uniques = pd.unique(vals_1d)

pandas's unique is more efficient than numpy's np.unique because it doesn't sort. 熊猫的unique性比numpy的np.unique更有效,因为它没有排序。

Take a look at this answer to a similar question. 看一下类似问题的答案 Note that Pandas' pd.unique function is considerably faster than Numpy's. 请注意,Pandas的pd.unique函数比Numpy快得多。

>>> pd.unique(sf[['x','y']].values.ravel())
array([2, 8, 5, 4, 1, 7, 6], dtype=object)

Although I don't know how to do it in SFrame, here's a longer explanation of @Merlin's answer: 虽然我不知道如何在SFrame中做到这一点,但对@ Merlin的回答有一个更长的解释:

>>> import pandas as pd
>>> df = pd.DataFrame({'x':[1,1,2,5,7], 'y':[2,4,6,8,2], 'z':[2,5,8,6,2]})
>>> df[['x', 'y']]
   x  y
0  1  2
1  1  4
2  2  6
3  5  8
4  7  2

To extract only columns X and Y 仅提取X和Y列

>>> df[['x', 'y']] # Extract only columns x and y
   x  y
0  1  2
1  1  4
2  2  6
3  5  8
4  7  2

To stack the 2 columns per row into 1 column row, while still being able to access them as a dictionary: 要将每行2列堆叠成1列行,同时仍然可以将它们作为字典访问:

>>> df[['x', 'y']].stack()                       
0  x    1
   y    2
1  x    1
   y    4
2  x    2
   y    6
3  x    5
   y    8
4  x    7
   y    2
dtype: int64
>>> df[['x', 'y']].stack()[0]      
x    1
y    2
dtype: int64
>>> df[['x', 'y']].stack()[0]['x']
1
>>> df[['x', 'y']].stack()[0]['y']
2

Count the individual values of all elements within the combined columns: 计算组合列中所有元素的各个值:

>>> df[['x', 'y']].stack().value_counts() # index(i.e. keys)=elements, Value=counts
2    3
1    2
8    1
7    1
6    1
5    1
4    1

To access the index and counts: 要访问索引并计数:

>>> df[['x', 'y']].stack().value_counts().index      
Int64Index([2, 1, 8, 7, 6, 5, 4], dtype='int64')
>>> df[['x', 'y']].stack().value_counts().values  
array([3, 2, 1, 1, 1, 1, 1])

Convert to a list: 转换为列表:

>>> sf[["x", "y"]].stack().value_counts().index.tolist()
[2, 1, 8, 7, 6, 5, 4]

Still an SFrame answer would be great too. SFrame的答案仍然很棒。 The same syntax doesn't work for SFrame. 相同的语法不适用于SFrame。

Here's a little benchmark between three possible methods: 以下是三种可能方法之间的一些基准:

from sframe import SFrame
import numpy as np
import pandas as pd
import timeit

sf = SFrame({'x': [1, 1, 2, 5, 7], 'y': [2, 4, 6, 8, 2], 'z': [2, 5, 8, 6, 2]})


def f1(sf):
    return sf['x'].unique().append(sf['y'].unique()).unique()


def f2(sf):
    return sf['x'].append(sf['y']).unique()


def f3(sf):
    return np.unique(sf[['x', 'y']].to_numpy())

N = 1000

print timeit.timeit('f1(sf)', setup='from __main__ import f1, sf', number=N)
print timeit.timeit('f2(sf)', setup='from __main__ import f2, sf', number=N)
print timeit.timeit('f3(sf)', setup='from __main__ import f3, sf', number=N)

# 13.3195129933
# 4.66225642657
# 3.65669089489
# [Finished in 23.6s]

Benchmark using python2.7.11 x64 on windows7+i7_2.6ghz 在windows7 + i7_2.6ghz上使用python2.7.11 x64进行基准测试

Conclusion: I'd suggest you use np.unique , that's basically f3 . 结论:我建议你使用np.unique ,这基本上是f3

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为我的数据框列获取唯一值作为新数据框的最快方法 - Fastest way to get unique values for my dataframe columns as a new dataframe 从Python dict获得独特的第一次出现的更有效方法 - More efficient way to get unique first occurrence from a Python dict 对数据框中的列进行排名的更有效方法 - More efficient way to rank columns in a dataframe Python - 是否有更有效的方法来转换 Dict 中的字符串值以获取每个 str 的唯一数字 - Python - Is there more efficient way to convert the string values in Dict to get unique numbers for each str 有没有更有效的方式来写入要列出的数据框(列和数据)? - Is there a more efficient way to write a dataframe (columns and data) to list? 遍历PySpark DataFrame和创建新列的更有效方法 - More efficient way to loop through PySpark DataFrame and create new columns PySpark - 一种查找具有多个不同值的 DataFrame 列的有效方法 - PySpark - an efficient way to find DataFrame columns with more than 1 distinct value 更新多个具有唯一值的模型对象的更有效方法 - More efficient way to update multiple model objects each with unique values MySQL:查询数百万行并找到唯一值的更有效方法 - MySQL: More Efficient Way to query millions of rows and find the unique values 创建列的内存高效方法,该列指示来自一组列的值的唯一组合 - memory efficient way to create a column that indicates a unique combination of values from a set of columns
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM