简体   繁体   English

写密钥以根据字典中的值分隔CSV

[英]Write key to separate csv based on value in dictionary

[Using Python3] I have a csv file that has two columns (an email address and a country code; script is made to actually make it two columns if not the case in the original file - kind of) that I want to split out by the value in the second column and output in separate csv files. [使用Python3]我有一个csv文件,该文件包含两列(电子邮件地址和国家/地区代码;如果原始文件中没有这种情况,则脚本实际上将其分为两列)第二列中的值,并在单独的csv文件中输出。

eppetj@desrfpkwpwmhdc.com       us      ==> output-us.csv
uheuyvhy@zyetccm.com            de      ==> output-de.csv
avpxhbdt@reywimmujbwm.com       es      ==> output-es.csv
gqcottyqmy@romeajpui.com        it      ==> output-it.csv
qscar@tpcptkfuaiod.com          fr      ==> output-fr.csv
qshxvlngi@oxnzjbdpvlwaem.com    gb      ==> output-gb.csv
vztybzbxqq@gahvg.com            us      ==> output-us.csv
...                             ...     ...

Currently my code kind of does this, but instead of writing each email address to the csv it overwrites the email placed before that. 目前,我的代码可以做到这一点,但不是将每个电子邮件地址都写入csv,而是覆盖了之前的电子邮件。 Can someone help me out with this? 有人可以帮我这个忙吗?

I am very new to programming and Python and I might not have written the code in the most pythonic way, so I would really appreciate any feedback on the code in general! 我是编程和Python的新手,我可能还没有以pythonic的方式编写过代码,因此,对于任何对代码的总体反馈,我都非常感谢!

Thanks in advance! 提前致谢!

Code: 码:

import csv

def tsv_to_dict(filename):
    """Creates a reader of a specified .tsv file."""
    with open(filename, 'r') as f:
        reader = csv.reader(f, delimiter='\t') # '\t' implies tab
        email_list = []
        # Checks each list in the reader list and removes empty elements
        for lst in reader:
            email_list.append([elem for elem in lst if elem != '']) # List comprehension
        # Stores the list of lists as a dict
        email_dict = dict(email_list)
    return email_dict

def count_keys(dictionary):
    """Counts the number of entries in a dictionary."""
    return len(dictionary.keys())

def clean_dict(dictionary):
    """Removes all whitespace in keys from specified dictionary."""
    return { k.strip():v for k,v in dictionary.items() } # Dictionary comprehension

def split_emails(dictionary):
    """Splits out all email addresses from dictionary into output csv files by country code."""
    # Creating a list of unique country codes
    cc_list = []
    for v in dictionary.values():
        if not v in cc_list:
            cc_list.append(v)

    # Writing the email addresses to a csv based on the cc (value) in dictionary
    for key, value in dictionary.items():
        for c in cc_list:
            if c == value:
                with open('output-' +str(c) +'.csv', 'w') as f_out:
                    writer = csv.writer(f_out, lineterminator='\r\n')
                    writer.writerow([key])

You can simplify this a lot by using a defaultdict : 您可以使用defaultdict大大简化此操作:

import csv
from collections import defaultdict

emails = defaultdict(list)

with open('email.tsv','r') as f:
   reader = csv.reader(f, delimiter='\t')
   for row in reader:
      if row:
         if '@' in row[0]:
           emails[row[1].strip()].append(row[0].strip()+'\n')

for key,values in emails.items():
   with open('output-{}.csv'.format(key), 'w') as f:
       f.writelines(values)

As your separated files are not comma separated, but single columns - you don't need the csv module and can simply write the rows. 由于分隔的文件不是逗号分隔的,而是单列-您不需要csv模块,只需写行即可。

The emails dictionary contains a key for each country code, and a list for all the matching email addresses. emails词典包含每个国家/地区代码的键,以及所有匹配的电子邮件地址的列表。 To make sure the email addresses are printed correctly, we remove any whitespace and add the a line break (this is so we can use writelines later). 为了确保正确打印电子邮件地址,我们删除所有空格并添加一个换行符(这样我们以后就可以使用writelines )。

Once the dictionary is populated, its simply a matter of stepping through the keys to create the files and then writing out the resulting list. 填充字典后,只需简单地逐步通过键来创建文件,然后写出结果列表即可。

The problem with your code is that it keeps opening the same country output file each time it writes an entry into it, thereby overwriting whatever might have already been there. 您的代码的问题在于,每次向其写入条目时,它都会一直打开相同的国家/地区输出文件,从而覆盖那里可能已有的所有内容。

A simple way to avoid that is to open all the output files at once for writing and store them in a dictionary keyed by the country code. 避免这种情况的一种简单方法是立即打开所有输出文件以进行写入,并将它们存储在以国家/地区代码为键的字典中。 Likewise, you can have another that associates each country code to a csv.writer object for that country's output file. 同样,您可以使用另一个将每个国家/地区代码与该国家/地区输出文件的csv.writer对象相关联。

Update: While I agree that Burhan's approach is probably superior, I feel that you have the idea that my earlier answer was excessively long due to all the comments it had -- so here's another version of essentially the same logic but with minimal comments to allow you better discern its reasonably-short true length (even with the contextmanager). 更新:虽然我同意Burhan的方法可能更好,但是我认为您有一个想法,即我之前的回答由于所有评论而过长-这是本质上相同的逻辑的另一个版本,但允许的评论最少您最好辨别其合理短的真实长度(即使使用contextmanager)。

import csv
from contextlib import contextmanager

@contextmanager  # to manage simultaneous opening and closing of output files
def open_country_csv_files(countries):
    csv_files = {country: open('output-'+country+'.csv', 'w') 
                   for country in countries}
    yield csv_files
    for f in csv_files.values(): f.close()

with open('email.tsv', 'r') as f:
    email_dict = {row[0]: row[1] for row in csv.reader(f, delimiter='\t') if row}

countries = set(email_dict.values())
with open_country_csv_files(countries) as csv_files:
    csv_writers = {country: csv.writer(csv_files[country], lineterminator='\r\n')
                    for country in countries}
    for email_addr,country in email_dict.items():
        csv_writers[country].writerow([email_addr])

Not a Python answer, but maybe you can use this Bash solution. 不是Python的答案,但是也许您可以使用此Bash解决方案。

$ while read email country
do
  echo $email >> output-$country.csv
done < in.csv

This reads the lines from in.csv , splits them into two parts email and country , and appends ( >> ) the email to the file called output-$country.csv . 这将从in.csv读取行,将其分为emailcountry两部分,并将email附加( >> )到名为output-$country.csv的文件中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 python 将字典中的键和值写入 csv - write key and value from dictionary to csv with python 在CSV中查找字典键匹配项并将值写入下一列 - Find dictionary key match in CSV and write value to the next column 根据字典值将单元格中的数据写入csv文件 - Write data in cell into csv file based on dictionary value 基于公共键和列值将字典键写入 csv 文件 - Writing dictionary keys to a csv file based on common key and column value python字典到csv,其中每个键在单独的行中以及单独的列中的值 - python dictionary to csv where each key is in seperate row and value in separate columns 读取csv文件的标头,查看其是否与字典键匹配,然后将该键的值写入行 - Reading header of csv file and seeing if it matches a dictionary key, then write value of that key to row 写入CSV文件,其中value是一个字典,并且value必须与键在同一行 - Write to CSV file where value is a dictionary, and value needs to be on same row as key 根据值更新字典键 - Update dictionary key based on value Pandas - 使用列中的值作为单独字典中的键 - Pandas - Use Value in Column as key in separate dictionary 根据来自单独列表的匹配项,使用来自字典列表中的值的键值对创建一个新字典 - Create a new dictionary with the key-value pair from values in a list of dictionaries based on matches from a separate list
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM