简体   繁体   English

Python 2.7,比较CSV的3列

[英]Python 2.7, comparing 3 columns of a CSV

What is the easiest/simplest way to iterate through a large CSV file in Python 2.7, comparing 3 columns? 比较3列,最简单/最简单的方法是遍历Python 2.7中的大型CSV文件吗?

I am a total beginner and have only completed a few online courses, I have managed to use CSV reader to do some basic stats on the CSV file, but nothing comparing groups within each other. 我是一个初学者,并且只完成了一些在线课程,我设法使用CSV阅读器对CSV文件进行了一些基本统计,但没有将彼此之间的组进行比较。

The data is roughly set up as follows: 数据大致设置如下:

Group   sub-group   processed
1           a       y
1           a       y
1           a       y
1           b           
1           b
1           b
1           c       y
1           c       y
1           c
2           d       y
2           d       y
2           d       y
2           e       y
2           e
2           e
2           f       y
2           f       y
2           f       y
3           g
3           g
3           g
3           h       y
3           h
3           h

Everything belongs to a group, but within each group are sub-groups of 3 rows (replicates). 一切都属于一个组,但是在每个组中都是3行(重复)的子组。 As we are working through samples, we will adding to the processed column, but we don't always do the full complement, so sometimes there will only be 1 or 2 processed out of the potential 3. 在处理样本时,我们将添加到处理的列中,但是我们并不总是进行完整的补充,因此有时在潜在值3中只有1或2个被处理。

I'm trying to work towards a statistic showing % completeness of each group, with a sub group being "complete" if it has at least 1 row processed (doesn't have to have all 3). 我正在尝试统计显示每个组的完整性百分比的统计信息,如果子组至少处理了1行(不必全部包含3行),则该子组为“完成”。

I've managed to get halfway there, by using the following: 通过使用以下方法,我设法做到了一半:

for row in reader:
    all_groups[group] = all_groups.get(group,0)+1   
    if not processed == "":
        processed_groups[group] = processed_groups.get(group,0)+1

result = {}
for family in (processed_groups.viewkeys() | all_groups.keys()):
    if group in processed_groups: result.setdefault(group, []).append(processed_groups[group])
        if group in processed_groups: result.setdefault(group, []).append(all_groups[group])

for group,v1 in result.items():
        todo = float(v1[0])
        done = float(v1[1])
        progress = round((100 / done * todo),2)
        print group,"--", progress,"%"

The problem with the above code is it doesn't take into account the fact that some sub-groups may not be totally processed. 上面的代码的问题是它没有考虑到某些子组可能未完全处理的事实。 As a result, the statistic will never read as 100% unless the processed column is always complete. 结果,除非处理的列始终是完整的,否则该统计信息永远不会显示为100%。

What I get:
Group 1 -- 55.56%
Group 2 -- 77.78%
Group 3 -- 16.67%

What I want:
Group 1 -- 66.67%%
Group 2 -- 100%
Group 3 -- 50%

How would you make it so that it just looks to see if the first row for each sub column is complete, and just use that, before continuing on to the next sub group? 在继续进行下一个子组之前,如何使它看起来像每个子列的第一行是否完整,并使用它?

One way to do this is with a couple of defaultdict of sets. 一种方法是使用几个set的defaultdict The first keeps track of all of the subgroups seen, the second keeps track of those subgroups that have been processed. 第一个跟踪所有可见的子组,第二个跟踪已处理的子组。 Using a set simplifies the code somewhat, as does using a defaultdict when compared to using a standard dictionary (although it's still possible). 与使用标准字典相比,使用集合可以稍微简化代码,就像使用defaultdict (尽管仍然可以)。

import csv
from collections import defaultdict

subgroups = defaultdict(set)
processed_subgroups = defaultdict(set)

with open('data.csv') as csvfile:
    for group, subgroup, processed in csv.reader(csvfile):
        subgroups[group].add(subgroup)
        if processed == 'y':
            processed_subgroups[group].add(subgroup)

    for group in sorted(processed_subgroups):
        print("Group {} -- {:.2f}%".format(group, (len(processed_subgroups[group]) / float(len(subgroups[group])) * 100)))

Output 输出量

Group 1 -- 66.67%
Group 2 -- 100.00%
Group 3 -- 50.00%

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM