简体   繁体   English

是否有工具可以将文件视为数据库中的表?

[英]Are there a tools that can help to treat files as tables in a database?

I have csv files and would like to treat them as tables of a database. 我有csv文件,想将它们视为数据库表。 Of course I can transform these files into tables. 当然,我可以将这些文件转换为表格。 But it would be nice to have a possibility to do it directly in the command line (in a way like grep , head , tail , sort and awk are used). 但是有可能直接在命令行中进行操作(使用grepheadtailsortawk类的方式)会很好。

For example I would like to select a particular column of a file (given by its name), or select rows where certain columns have certain values, or order by one of the columns. 例如,我想select的文件的特定列(通过其名称给定的),或选择的行where某些列有一定的值,或者order by的列中的一个。

Since you tagged this with python and ipython, I assume you'd like to see what it would be like to do this from an ipython prompt. 由于您使用python和ipython对此进行了标记,因此我假设您想在ipython提示符下查看执行此操作的方式。 So, here's a trivial CSV file people.csv: 因此,这是一个简单的CSV文件people.csv:

first,last,age
John,Smith,20
Jane,Smith,19
Frank,Jones,30

Now, here's an ipython session using it: 现在,这是一个使用它的ipython会话:

In [1]: import csv
In [2]: from operator import *
In [3]: with open('foo.csv') as f: people = list(csv.DictReader(f))
In [4]: [p['age'] for p in sorted(people, key=itemgetter('first')) if p['last'] == 'Smith']
Out[4]: ['19', '20']

It takes one line to read a CSV file into memory as a list of dicts. 将CSV文件作为字典列表读入内存需要一行。

Given that, you can run list comprehensions on it. 鉴于此,您可以对其运行列表推导。

So, the p['age'] selects a column by name; 因此, p['age']按名称选择一列; the sorted(people, itemgetter('first')) orders by another column, and the if p['last'] == 'Smith' is a where clause. sorted(people, itemgetter('first'))由另一列sorted(people, itemgetter('first')) ,并且if p['last'] == 'Smith'是where子句。

That second one is a bit clunky, but we can fix that: 第二个有点笨拙,但是我们可以解决这个问题:

In [5]: def orderby(table, column): return sorted(table, key=itemgetter(column))
In [6]: [p['age'] for p in orderby(people, 'first') if p['last'] == 'Smith']
Out[6]: ['19', '20']

You can even do group by clauses with a little help from itertools , although here you'll definitely want to define helper functions both for groupby and for the aggregates to apply to groups, and I think it still might be pushing the limits a bit… 您甚至可以在itertools的少许帮助下进行group by子句,尽管您在这里肯定要定义用于groupby和将聚合应用于组的辅助函数,而且我认为它可能仍在推动限制...

In [7]: from itertools import *
In [8]: def ilen(iterable): return sum(1 for _ in iterable)
In [9]: def group(table, column): return groupby(table, itemgetter(column))
In [10]: [(k, ilen(g)) for k, g in group(people, 'last')]
Out[10]: [('Smith', 2), ('Jones', 1)]
In [11]: def glen(kg): return kg[0], sum(1 for _ in kg[1])
In [12]: [glen(g) for g in group(people, 'last')]
Out[12]: [('Smith', 2), ('Jones', 1)]
In [13]: def gsum(kg, column): return kg[0], sum(int(x[column]) for x in kg[1])
In [14]: [gsum(g, 'age') for g in group(people, 'last')]
Out[14]: [('Smith', 39), ('Jones', 30)]

However, there are a few things to keep in mind: 但是,请记住以下几点:

  • It requires reading the whole thing into memory. 它需要将整个内容读入内存。
  • There are no "indexes". 没有“索引”。 With a database, selecting the 20 Smiths out of 100000 people only needs log(100000)+20 steps; 使用数据库,从100000人中选择20个Smiths只需要log(100000)+20个步骤; with a list, it needs 100000 steps. 一个列表,它需要100000个步骤。
  • You have to order the operations appropriately. 您必须适当地订购操作。 When you want to order, then filter rows, then filter columns (as in the example above), everything is easy; 当您要订购时,然后过滤行,然后过滤列(如上例所示),一切都很容易; if you want a different order (especially if you want to order or filter by columns you aren't selecting), you may need to write more complex comprehensions, while with a database there's no problem at all. 如果您想要不同的顺序(特别是如果要对未选择的列进行排序或过滤),则可能需要编写更复杂的理解,而使用数据库则完全没有问题。

Keep in mind that it's only about 5 lines of code to convert a CSV file to a sqlite table. 请记住,将CSV文件转换为sqlite表仅需5行代码。 So, I think you'd be better off with a script that just runs your 5-line Python program and dumps you into a sqlite command line. 因此,我认为使用只运行5行Python程序并将其转储到sqlite命令行的脚本会更好。

Since you tagged this with 'python', python's 'pandas' module provides a DataFrame object that provides the functionality that you seem to want here. 由于您使用“ python”标记了此内容,因此python的“ pandas”模块提供了一个DataFrame对象,该对象提供了您在这里想要的功能。 Use pandas.read_csv() to read in the CSV file. 使用pandas.read_csv()读取CSV文件。 A quick primer on DataFrames is provided here: http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe 此处提供有关DataFrames的快速入门: http ://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM