简体   繁体   中英

Search in two dimensional array in Python

I'd like to be able to retrieve specifics rows in a large dataset (9M lines, 1.4 GB) given two or more parameters through Python.

For example, from this dataset :

ID1 2   10  2   2   1   2   2   2   2   2   1

ID2 10  12  2   2   2   2   2   2   2   1   2

ID3 2   22  0   1   0   0   0   0   0   1   2

ID4 14  45  0   0   0   0   1   0   0   1   1

ID5 2   8   1   1   1   1   1   1   1   1   2

Given the example parameters :

  • second column must be equal to 2, and
  • third column must be within a range from 4 to 15

I should obtain :

ID1 2   10  2   2   1   2   2   2   2   2   1

ID5 2   8   1   1   1   1   1   1   1   1   2

The problem is that i don't know how to do these operations efficiently on a two dimensional array in Python.

This is what i tried :

line_list = []

# Loading of the whole file in memory
for line in file:
    line_list.append(line)

# set conditions
i = 2
start_range = 4
end_range = 15

# Iteration through the loaded list and split for each column
for index in data_list:
    data = index.strip().split()
    # now test if the current line matches with conditions
    if(data[1] == i and data[2] >= start_range and data[2] <= end_range):
        print str(data)

I'd like to perform this process a lot of times an the way i'm doing it is really slow, even with the data file loaded in memory.

I was thinking about using numpy arrays but i don't know how to retrieve a row given conditions.

Thanks for your help !

UPDATE :

As suggested, i used a relational database system. I chose Sqlite3 as it is pretty easy to use and quick to deploy.

My file was loaded through an import function in sqlite3 in roughly 4 minutes.

I did an index on the second and third column to accelerate the process when retrieving information.

The query was done through Python, with the module "sqlite3".

That is way, way faster !

I'd go for almost what you've got (un-tested):

with open('somefile') as fin:
    rows = (line.split() for line in fin)
    take = (row for row in rows if int(row[1] == 2) and 4 <= int(row[2]) <= 15)
    # data = list(take)
    for row in take:
        pass # do something

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM