简体   繁体   中英

Is there a module in Python that does something like “sqldf” for R?

List comprehensions are very good. But some kind of "... Join ..." would be very useful. Thanks. So for example. I have a Set A= {1,0}, a list B = [[1,1],[2,3]]. I would like to find all rows in B where the second colomu is one of the values in A. Or some thing more general, I have 2 CSV files. I want to find out all the rows where the values of some colonm from the two files match. Just like some kind of 'join' of two files. One of the files is GB size. sqldf is "SQL select on R data frames."

You can use pandasql, which allows for SQL style querying of pandas DataFrames. It's very similar to sqldf.

https://github.com/yhat/pandasql/

(full disclaimer, I wrote it)

EDIT: blog post documenting some of the features found here: http://blog.yhathq.com/posts/pandasql-sql-for-pandas-dataframes.html

I'm unaware of a library doing what you ask (but I only glanced at the sqldf documentation), however nothing of what you asked really requires a library, they are one-liners in python (and you could of course abstract the functionality creating a function rather than a simple list comprehension...)

Set A= {1,0}, a list B = [[1,1],[2,3]]. I would like to find all rows in B where the second column is one of the values in A.

>>> a = set([1, 0])
>>> b = [[1,1],[2,3]]
>>> [l for l in b if l[1] in a]
[[1, 1]]

I have 2 CSV files. I want to find out all the rows where the values of some column from the two files match.

>>> f1 = [[1, 2, 3], [4, 5, 6]]
>>> f2 = [[0, 2, 8], [7, 7, 7]]
>>> [tuple_ for tuple_ in zip(f1, f2) if tuple_[0][1] == tuple_[1][1]]
[([1, 2, 3], [0, 2, 8])]

EDIT: If memory usage is a problem you should use generators instead of lists. For example:

>>> zip(f1, f2)
[([1, 2, 3], [0, 2, 8]), ([4, 5, 6], [7, 7, 7])]

but using generators:

>>> import itertools as it
>>> gen = it.izip(f1, f2)
>>> gen
<itertools.izip object at 0x1f24ab8>
>>> next(gen)
([1, 2, 3], [0, 2, 8])
>>> next(gen)
([4, 5, 6], [7, 7, 7])

And for the data source:

>>> [line for line in f1]
[[1, 2, 3], [4, 5, 6]]

translate as generator as:

>>> gen = (line for line in f1)
>>> gen
<generator object <genexpr> at 0x1f159b0>
>>> next(gen)
[1, 2, 3]
>>> next(gen)
[4, 5, 6]

Before you can do the functionality of sqldf you need the functionality of 'df', ie dataframes. Python has a cuddly version: pandas:

http://pandas.sourceforge.net/

Perhaps the section on joining and merging will help:

http://pandas.sourceforge.net/merging.html

I recommend you start with something smaller than your gigabyte files though!

There is a package available now which does exactly this! Check the link below:

pysqldf => https://pypi.org/project/pysqldf/

This package will allow you to query pandas dataframe using SQL just like sqldf did in R

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM