从制表符分隔文件的列表产品中删除重复项并进一步分类

Question

我有一个制表符分隔文件，我需要从中提取所有第12列内容（哪些文档类别）。 然而，第12列内容是高度重复的，所以首先我需要获得一个只返回类别数量的列表（通过删除重复）。 然后我需要找到一种方法来获得每个类别的行数。 我的尝试如下：

def remove_duplicates(l): # define function to remove duplicates
    return list(set(l))

input = sys.argv[1] # command line arguments to open tab file
infile = open(input)
for lines in infile: # split content into lines
    words = lines.split("\t") # split lines into words i.e. columns
    dataB2.append(words[11]) # column 12 contains the desired repetitive categories
    dataB2 = dataA.sort() # sort the categories
    dataB2 = remove_duplicates(dataA) # attempting to remove duplicates but this just returns an infinite list of 0's in the print command
    print(len(dataB2))
infile.close()

我不知道如何获得每个类别的行数？ 所以我的问题是：如何有效地消除重复？ 以及如何获得每个类别的行数？

Answer 1

我建议使用python Counter来实现它。 计数器几乎完全符合您的要求，因此您的代码如下所示：

from collections import Counter
import sys

count = Counter()

# Note that the with open()... syntax is generally preferred.
with open(sys.argv[1]) as infile:
  for lines in infile: # split content into lines
      words = lines.split("\t") # split lines into words i.e. columns
      count.update([words[11]])

print count

Answer 2

您需要做的就是从文件中读取每一行，按标签拆分，每行抓取第12列并将其放入列表中。 （如果你不关心重复行只是使column_12 = set()并使用add(item)而不是append(item) ）。 然后你只需使用len（）来获取集合的长度。 或者，如果您想要两者，您可以使用列表并稍后将其更改为一组。

编辑：计算每个类别（谢谢汤姆莫里斯警告我事实上我没有回答这个问题）。 迭代column_12的集合，以便不计算任何多次，并使用在count()方法中构建的列表。

with open(infile, 'r') as fob:
    column_12 = []
    for line in fob:
        column_12.append(line.split('\t')[11])

print 'Unique lines in column 12 %d' % len(set(column_12))
print 'All lines in column 12 %d' % len(column_12)
print 'Count per catagory:'
for cat in set(column_12):
    print '%s - %d' % (cat, column_12.count(cat))

从制表符分隔文件的列表产品中删除重复项并进一步分类

问题描述

2 个解决方案

解决方案1
2 2015-12-15 02:53:26

解决方案2
1 2015-12-15 03:01:22

从制表符分隔文件的列表产品中删除重复项并进一步分类

问题描述

2 个解决方案

解决方案1 2 2015-12-15 02:53:26

解决方案2 1 2015-12-15 03:01:22

解决方案1
2 2015-12-15 02:53:26

解决方案2
1 2015-12-15 03:01:22