简体   繁体   English

在Python中搜索二维数组

[英]Search in two dimensional array in Python

I'd like to be able to retrieve specifics rows in a large dataset (9M lines, 1.4 GB) given two or more parameters through Python. 我希望能够通过Python给定两个或多个参数来检索大型数据集(900万行,1.4 GB)中的特定行。

For example, from this dataset : 例如,从该数据集中:

ID1 2   10  2   2   1   2   2   2   2   2   1

ID2 10  12  2   2   2   2   2   2   2   1   2

ID3 2   22  0   1   0   0   0   0   0   1   2

ID4 14  45  0   0   0   0   1   0   0   1   1

ID5 2   8   1   1   1   1   1   1   1   1   2

Given the example parameters : 给定示例参数:

  • second column must be equal to 2, and 第二列必须等于2,并且
  • third column must be within a range from 4 to 15 第三列必须在4到15的范围内

I should obtain : 我应该获得:

ID1 2   10  2   2   1   2   2   2   2   2   1

ID5 2   8   1   1   1   1   1   1   1   1   2

The problem is that i don't know how to do these operations efficiently on a two dimensional array in Python. 问题是我不知道如何在Python中的二维数组上有效地执行这些操作。

This is what i tried : 这是我尝试的:

line_list = []

# Loading of the whole file in memory
for line in file:
    line_list.append(line)

# set conditions
i = 2
start_range = 4
end_range = 15

# Iteration through the loaded list and split for each column
for index in data_list:
    data = index.strip().split()
    # now test if the current line matches with conditions
    if(data[1] == i and data[2] >= start_range and data[2] <= end_range):
        print str(data)

I'd like to perform this process a lot of times an the way i'm doing it is really slow, even with the data file loaded in memory. 我想执行此过程很多次,但我的方法确实很慢,即使将数据文件加载到内存中也是如此。

I was thinking about using numpy arrays but i don't know how to retrieve a row given conditions. 我正在考虑使用numpy数组,但是我不知道如何在给定条件的情况下检索行。

Thanks for your help ! 谢谢你的帮助 !

UPDATE : 更新:

As suggested, i used a relational database system. 如建议的那样,我使用了关系数据库系统。 I chose Sqlite3 as it is pretty easy to use and quick to deploy. 我选择Sqlite3是因为它易于使用且部署迅速。

My file was loaded through an import function in sqlite3 in roughly 4 minutes. 我的文件是通过sqlite3中的导入功能加载的,大约需要4分钟。

I did an index on the second and third column to accelerate the process when retrieving information. 我在第二和第三列上做了索引,以加快检索信息时的过程。

The query was done through Python, with the module "sqlite3". 该查询是通过Python使用模块“ sqlite3”完成的。

That is way, way faster ! 那是方法,方法更快!

I'd go for almost what you've got (un-tested): 我几乎会去买(未经测试)的东西:

with open('somefile') as fin:
    rows = (line.split() for line in fin)
    take = (row for row in rows if int(row[1] == 2) and 4 <= int(row[2]) <= 15)
    # data = list(take)
    for row in take:
        pass # do something

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM