从 csv 中删除重复项

Question

我有一个包含以下内容的 csv/txt 文件：

Mumbai 2
Pune 6
Bangalore 8
Pune 10
Mumbai 8

我想要这个在 output 文件中：

Mumbai 2,8
Pune 6,10
Bangalore 8

注意：不要使用任何 python 模块、包

Answer 1

这是一个可能的解决方案：

import re

linepat = re.compile('''

  ^ \s*
  (?:
    (
      [A-Za-z] \S*
      (?: \s+ [A-Za-z] \S* )*
    ) \s+ ( [0-9]+ )
    \s* $
  )
  |
  (.*)

''', re.VERBOSE)

filtered = {}

# fill `filtered` from `duplicates.csv`
with open('duplicates.csv', 'r') as f:
  for lnum, line in enumerate(f, start=1):
    city, number, invalid = linepat.match(line).groups()
    if not city:
      invalid = invalid.strip()
      if invalid:
        raise Exception(f'line {lnum} has a wrong format:\n{line}')
    else:
      city = ' '.join(city.split())
      if city not in filtered:
        filtered[city] = set()
      filtered[city].add(int(number))

# write `filtered` to `without_duplicates.csv`
with open('without_duplicates.csv', 'w') as f:
  for city, numbers in filtered.items():
    numbers = ','.join(str(num) for num in sorted(numbers))
    f.write(f'{city} {numbers}\n')

# Mumbai 2
# Pune 6
# New York 15
#
# Bangalore 8
# Pune 10
# Mumbai 8
# New York 1
#
# -->
#
# Mumbai 2,8
# Pune 6,10
# New York 1,15
# Bangalore 8

从您的示例中不清楚 output 中每行的数字应如何排序。 如果您希望它们按输入文件中的第一次出现排序，请使用列表而不是集合并执行citynumbers = filtered[city]; if number not in citynumbers: citynumbers.append(number) citynumbers = filtered[city]; if number not in citynumbers: citynumbers.append(number) ，然后让它们不sorted() 。

将城市名称与其编号分开的空格也可能是城市名称的一部分。 因此，正则表达式要求城市名称的每一部分都以[a-zA-Z]开头。 更清洁的是，要求替换或转义城市名称中的空格。

代码示例中的filtered也可以是defaultdict(set) 。

对于许多用例， csv 模块是更简单的方法。

从 csv 中删除重复项

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-11-17 05:44:13

从 csv 中删除重复项

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-11-17 05:44:13

解决方案1
0 已采纳 2022-11-17 05:44:13