简体   繁体   English

使用python的toolz包解析CSV

[英]Parsing a CSV using python's toolz package

I recently came across the toolz repository and decided to give it a spin. 我最近遇到了toolz存储库,并决定给它一个旋转。

Unfortunately, I'm having some trouble properly using it, or at least understanding it. 不幸的是,我在使用它时遇到了一些麻烦,或者至少对它有所了解。

My first simple task for myself was to parse a tab separated TSV file and get the second column entry in it. 我自己的第一个简单任务是解析一个制表符分隔的TSV文件并获取其中的第二列条目。

For example, given the file foo.tsv : 例如,给定文件foo.tsv

a    b    c
d    e    f

I'd like to return a list of ['b', 'e'] . 我想返回一个['b', 'e'] I successfully achieved that with the following piece of logic 我用以下逻辑成功实现了这一点

from toolz.curried import *

with open("foo.tsv", 'r') as f:
    data = pipe(f, map(str.rstrip),
                           map(str.split),
                           map(get(1)),
                           tuple)
    print(data)

However, if I change the foo.tsv file to use commas instead of tabs as the column delimiters I cannot seem to figure out the best way to adjust the above code to handle that. 但是,如果我将foo.tsv文件更改为使用逗号而不是制表符作为列分隔符,我似乎无法找出调整上述代码来处理它的最佳方法。 It's not clear to me how to add best a "," argument to the str.split function while using the map with either the pipe or thread_first functions. 我不清楚如何在使用带有pipethread_first函数的map时为str.split函数添加最好的","参数。

Is there already some existing documentation that already describes this? 是否已有一些已经描述过的现有文档?

lambdas lambda表达式

Don't be afraid of using lambdas. 不要害怕使用lambdas。

map(lambda s: s.split(','))

It's maybe a bit less pretty than map(str.split) but it gets the point across 它可能不如map(str.split)那么漂亮,但它得到了重点

Use pluck 使用采摘

Consider using pluck(...) rather than map(get(...)) 考虑使用pluck(...)而不是map(get(...))

map(get(1)) -> pluck(1)

Use Pandas 使用熊猫

If you have a CSV file you might consider just using Pandas, which is very fast and highly optimized for this kind of work. 如果你有一个CSV文件,你可能会考虑使用Pandas,这是非常快速和高度优化的这种工作。

Based upon MRocklin 's above answer, my CSV parsing code using toolz should look more like: 基于MRocklin的上述答案,我使用toolz CSV解析代码应该更像:

with open("foo.tsv", 'r') as f:
    data = pipe(f, map(lambda (s): str.rstrip(s, "\n")),
                   map(lambda (s): str.split(s, "\t")),
                   pluck(1),
                   tuple)
    print(data)

Your version for the tsv file can be shortened to: 您的tsv文件版本可以缩短为:

pipe(f, map(str.split), pluck(1), tuple)

To read a comma separated file, use something like this: 要读取逗号分隔文件,请使用以下内容:

pipe(f, map(lambda s: s.split(',')), pluck(1), map(str.strip), tuple)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM