用于搜索的Pandas列索引？

Question

In relational database, we can create index on columns to speed up querying and joining on those columns. 在关系数据库中，我们可以在列上创建索引，以加快查询和加入这些列。 I want to do them same thing on pandas dataframe. 我想在pandas数据帧上做同样的事情。 The row index seems not what relational database offers. 行索引似乎不是关系数据库提供的。

The question is: Are columns in pandas indexed for searching by default? 问题是：pandas中的列是否为默认搜索索引？

If not, is it possible to index columns manually and how to do it? 如果没有，是否可以手动索引列以及如何进行索引？

Edit: I have read pandas docs and searched everywhere, but no one mentions indexing and searching/merging performance on pandas. 编辑：我已经阅读过pandas文档并在各处搜索，但没有人提到索引和搜索/合并大熊猫的表现。 Seem no one care about this issue, although it is critical in relational database. 虽然它在关系数据库中很重要，但似乎没有人关心这个问题。 Can any one make a statement about indexing and performance on pandas? 任何人都可以在熊猫上做出关于索引和性能的陈述吗？

Thanks. 谢谢。

Answer 1

As mentioned by @pvg - The pandas model is not that of an in memory relational databases. 如@pvg所述 - 熊猫模型不是内存关系数据库的模型。 So, it won't help us much if we try to analogize pandas in terms of sql and it's idiosyncracies. 所以，如果我们尝试用sql和它的特性来模拟大熊猫，它对我们没什么帮助。 Instead, let's look at the problem fundamentally - you're effectively trying to speed up column lookups/ joins. 相反，让我们从根本上看问题 - 你正在有效地尝试加速列查找/连接。

You can speed up joins considerably by setting the column you wish to join by as the index in both dataframes (left and right dataframes that you wish to join) and then sorting both the indexes . 通过将要加入的列设置为两个数据框 （您希望加入的左右数据框）中的索引 ，然后对两个索引进行排序，可以大大加快连接速度。

Here's an example to show you the kind of speed up you can get when joining on sorted indexes: 这是一个示例，向您展示加入已排序索引时可以获得的加速类型：

import pandas as pd
from numpy.random import randint

# Creating DATAFRAME #1
columns1 = ['column_1', 'column_2']
rows_df_1 = []

# generate 500 rows
# each element is a number between 0 and 100
for i in range(0,500):
    row = [randint(0,100) for x in range(0, 2)]
    rows_df_1.append(row)

df1 = pd.DataFrame(rows_df_1)
df1.columns = columns1

print(df1.head())

The first dataframe looks like this: 第一个数据框如下所示：

Out[]:    

column_1  column_2
0        83        66
1        91        12
2        49         0
3        26        75
4        84        60

Let's create the second dataframe: 让我们创建第二个数据帧：

columns2 = ['column_3', 'column_4']
rows_df_2 = []
# generate 500 rows
# each element is a number between 0 and 100
for i in range(0,500):
    row = [randint(0,100) for x in range(0, 2)]
    rows_df_2.append(row)

df2 = pd.DataFrame(rows_df_1)
df2.columns = columns2

The second dataframe looks like this: 第二个数据框如下所示：

Out[]:    

   column_3  column_4
0        19        26
1        78        44
2        44        43
3        95        47
4        48        59

Now let's say you wish to join these two dataframes on column_1 == column_3 现在假设您希望在column_1 == column_3上加入这两个数据帧

# setting the join columns as indexes for each dataframe
df1 = df1.set_index('column_1')
df2 = df2.set_index('column_3')


# joining
%time
df1.join(df2)

Out[]:
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 46 ms

As you can see, just setting the join columns as dataframe indexes and joining after - takes around 46 milliseconds. 如您所见，只需将连接列设置为数据帧索引并在之后加入 - 大约需要46毫秒。 Now, let's try joining *after sorting the indexes* 现在，让我们在排序索引后尝试加入*

# sorting indexes
df1 = df1.sort_index()
df2 = df2.sort_index()

Out[]:

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 9.78 µs

This takes around 9.78 µs, much much faster. 这大约需要9.78μs，速度要快得多。

I believe you can apply the same sorting technique to pandas columns - sort the columns lexicographically and modify the dataframe. 我相信你可以对pandas列应用相同的排序技术 - 按字典顺序对列进行排序并修改数据帧。 I haven't tested the code below, but something like this should give you a speedup on column lookups: 我没有测试下面的代码，但是这样的事情可以让你加速列查找：

import numpy as np
# Lets assume df is a dataframe with thousands of columns
df = read_csv('csv_file.csv')
columns = np.sort(df.columns)

df = df[columns]

Now column lookups should be much faster - would be great if someone could test this out on a dataframe with a thousand of columns 现在列查找应该更快 - 如果有人可以在具有数千列的数据帧上测试它，那将是很好的

用于搜索的Pandas列索引？

问题描述

1 个解决方案

解决方案1
2 2017-03-07 06:45:06

用于搜索的Pandas列索引？

问题描述

1 个解决方案

解决方案1 2 2017-03-07 06:45:06

解决方案1
2 2017-03-07 06:45:06