简体   繁体   中英

Pandas column indexing for searching?

In relational database, we can create index on columns to speed up querying and joining on those columns. I want to do them same thing on pandas dataframe. The row index seems not what relational database offers.

The question is: Are columns in pandas indexed for searching by default?

If not, is it possible to index columns manually and how to do it?

Edit: I have read pandas docs and searched everywhere, but no one mentions indexing and searching/merging performance on pandas. Seem no one care about this issue, although it is critical in relational database. Can any one make a statement about indexing and performance on pandas?

Thanks.

As mentioned by @pvg - The pandas model is not that of an in memory relational databases. So, it won't help us much if we try to analogize pandas in terms of sql and it's idiosyncracies. Instead, let's look at the problem fundamentally - you're effectively trying to speed up column lookups/ joins.

You can speed up joins considerably by setting the column you wish to join by as the index in both dataframes (left and right dataframes that you wish to join) and then sorting both the indexes .

Here's an example to show you the kind of speed up you can get when joining on sorted indexes:

import pandas as pd
from numpy.random import randint

# Creating DATAFRAME #1
columns1 = ['column_1', 'column_2']
rows_df_1 = []

# generate 500 rows
# each element is a number between 0 and 100
for i in range(0,500):
    row = [randint(0,100) for x in range(0, 2)]
    rows_df_1.append(row)

df1 = pd.DataFrame(rows_df_1)
df1.columns = columns1

print(df1.head())

The first dataframe looks like this:

Out[]:    

column_1  column_2
0        83        66
1        91        12
2        49         0
3        26        75
4        84        60

Let's create the second dataframe:

columns2 = ['column_3', 'column_4']
rows_df_2 = []
# generate 500 rows
# each element is a number between 0 and 100
for i in range(0,500):
    row = [randint(0,100) for x in range(0, 2)]
    rows_df_2.append(row)

df2 = pd.DataFrame(rows_df_1)
df2.columns = columns2

The second dataframe looks like this:

Out[]:    

   column_3  column_4
0        19        26
1        78        44
2        44        43
3        95        47
4        48        59

Now let's say you wish to join these two dataframes on column_1 == column_3

# setting the join columns as indexes for each dataframe
df1 = df1.set_index('column_1')
df2 = df2.set_index('column_3')


# joining
%time
df1.join(df2)

Out[]:
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 46 ms

As you can see, just setting the join columns as dataframe indexes and joining after - takes around 46 milliseconds. Now, let's try joining *after sorting the indexes*

# sorting indexes
df1 = df1.sort_index()
df2 = df2.sort_index()

Out[]:

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 9.78 µs

This takes around 9.78 µs, much much faster.

I believe you can apply the same sorting technique to pandas columns - sort the columns lexicographically and modify the dataframe. I haven't tested the code below, but something like this should give you a speedup on column lookups:

import numpy as np
# Lets assume df is a dataframe with thousands of columns
df = read_csv('csv_file.csv')
columns = np.sort(df.columns)

df = df[columns]

Now column lookups should be much faster - would be great if someone could test this out on a dataframe with a thousand of columns

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM