简体   繁体   中英

Python : CSV manipulate multiple column

I am quite new to Python, I am trying process a CSV file with multiple column, the first column is the server name and rest of the columns are information about the server.

Sample data :

**Client Name,Job Duration,Job File Count,Throughput (KB/sec),Job Primary ID,Schedule/Level Type,Master Server,Media Server,Policy Name,Job Type,Job Attempt Count,Schedule Name,Protected Data Size(MB),Accelerator Enabled,Job Start Time,Accelerator Data Sent (MB),Accelerator Savings(MB),Accelerator Optimization %,Job End Time,Deduplication Enabled,Post Deduplication Size(MB),Deduplication Savings (MB),Total Optimization % (Accelerator + Deduplication),Job Status,Status Code,Policy Keyword,Storage Unit Name**
 ambgsun39,00:12:00,0,0,37525,Full,MYPVLXBAKCLU,ambglx24,C1_F4_AD_SHS_COMPUTRON_DGLP_COLD,Backup,1,Monthly_Full,0,No,"Aug 1, 2015 3:00:00 AM",-,0,0,"Aug 1, 2015 3:12:00 AM",No,0,0,0,Successful,0,-,stu_PDC99002_IP_ambglx24
ambglx21,00:03:02,0,0,37527,Full,MYPVLXBAKCLU,ambglx21,C2_F6_AM_REB_CFS,Backup,1,UNKNOWN,0,No,"Aug 1, 2015 3:00:00 AM",-,0,0,"Aug 1, 2015 3:03:02 AM",No,0,0,0,Successful,0,-,UNKNOWN
ambglx21,00:03:42,0,0,37528,Full,MYPVLXBAKCLU,ambglx21,C2_F6_AM_REB_CFS_DB,Backup,1,UNKNOWN,0,No,"Aug 1, 2015 3:00:00 AM",-,0,0,"Aug 1, 2015 3:03:42 AM",No,0,0,0,Successful,0,-,UNKNOWN
ambgsun39,00:11:02,1,"95,543",37531,User backup,MYPVLXBAKCLU,ambglx24,C1_F4_AD_SHS_COMPUTRON_DGLP_COLD,Backup,1,Default-Application-Backup,"60,834.78",No,"Aug 1, 2015 3:00:24 AM",-,0,0,"Aug 1, 2015 3:11:26 AM",No,"60,834.78",0,0,Successful,0,-,stu_PDC99002_IP_ambglx24
dvmpwin040,00:01:41,"170,305","336,398",37532,Full,MYPVLXBAKCLU,ambglx21,C2_F2_AM_SHS_FTP,Backup,1,Daily_Full,"29,894.78",Yes,"Aug 1, 2015 3:00:25 AM","1,494.74","28,400.04",95,"Aug 1, 2015 3:02:06 AM",No,"29,894.78",0,0,Successful,0,-,stu_PDC99001_IP_ambglx21
dvmpwin048,00:04:57,"44,133","515,413",37535,Full,MYPVLXBAKCLU,ambglx21,C2_F2_AM_SHS_Crystal_Reports,Backup,1,Daily_Full,"145,440.72",Yes,"Aug 1, 2015 3:00:35 AM","5,817.63","139,623.09",96,"Aug 1, 2015 3:05:32 AM",No,"1

There are multiple entries for the same server, I need toextract columns Job Duration, Job File count, Throughput, Protected Data size and get the average for the each columns with unique server name entry.

end state :

Client Name, Average Job Duration, Average job File count, Average Throughput, Average Protected Data size

ambglx21, 00:10:00, 25000, 50000, 25000

I am able to figure out only part of it.

import csv
from collections import defaultdict


csv_data = defaultdict(list)

for i, row in enumerate(csv.reader(open('data.csv', 'rt'))):
    if not i or not row:
        continue
    client_name,job_duration,job_file_count,throughput,job_primary_id,schedule,master_server,nedia_server,policy_name,job_type,job_attempt_count,schedule_name,protected_data_size,accelerator_enabled,job_start_time,accelerator_data_sent,acceleartor_savings,accelerator_optimisation,job_end_time,deduplication_enabled,post_deduplicaiton_size,deduplication_savings,total_optimisation,job_status,status_code,policy_keyword,storage_unit_name = row
    throughput          = int(throughput.replace(',', ''))
    protected_data_size = float(protected_data_size.replace(',', ''))
    csv_data[client_name].append(int(throughput))
    #csv_data[client_name].append(job_duration)
    #csv_data[client_name].append(float(protected_data_size))

for client_name, throughputs in csv_data.items():
    throughputs = int(int(sum(throughputs)) / int(len(throughputs)) / 1024)
    #protected_data = int(int(sum(protected_data)) / int(len(protected_data)) / 1024)
    print(client_name, throughputs)

I am able to get only throughputs use dictionary. I am not sure how to append rest of the data and process it.

Current script output:

bvmpwin017 1145

ambgjmp01 3620

ambglx22 8

Thanks a lot for your help, any insight is really appreciated.

I don't use csv often but this looked like a nice challenge. I think you described your problem pretty well but just weren't able to get all the data you needed out of the csv file. The times and some other issues were not trivial. I hope this shows you a way to approach it.

By the way, the custom interpretation of data here should probably be done through some custom use of the csv module but I don't have experience with that and didn't see how to use it. Perhaps someone else can show how to do it.

The comments in the code will hopefully explain how it works.

import csv
import datetime


# tell the application how to interpret and do averages with special column types
ELAPSED_TIME_REFERENCE = datetime.datetime.strptime('0:0:0', '%H:%M:%S')
def get_elapsed(s):
    h, m, s = s.split(':')
    delta = datetime.timedelta(days=0, hours=int(h), minutes=int(m), seconds=int(s))
    return delta.total_seconds()


def get_int(s):
    return int(s.replace(',', ''))


def get_float(s):
    return float(s.replace(',', ''))


def numerical_average(values):
    return float(sum(values))/max(len(values), 1)


def elapsed_average(elapsed_times_s):
    average_elapsed_s = numerical_average(elapsed_times_s)
    delta = datetime.timedelta(seconds=average_elapsed_s)
    return delta


CONVERTER = 'converter'
AVERAGE = 'average'
HEADER_TO_TOOLS = {'Job Duration': {CONVERTER: get_elapsed,
                                    AVERAGE: elapsed_average},
                   'Job File Count': {CONVERTER: get_int,
                                      AVERAGE: numerical_average},
                   'Throughput (KB/sec)': {CONVERTER: get_float,
                                           AVERAGE: numerical_average},
                   'Protected Data Size(MB)': {CONVERTER: get_float,
                                               AVERAGE: numerical_average}}


def interpret_string(header, s):
    tools = HEADER_TO_TOOLS.get(header)
    if tools:
        return tools[CONVERTER](s)
    return s  # don't interpret if no interpreter exists


# collect all data as: {client_name: {header1: list_of_values, header2: list_of_values}}
data_dict = {}
with open('data.csv') as f:
    reader = csv.reader(f)
    headers = tuple(x.strip() for x in next(reader))  # first row
    for row in reader:
        client_name = row[0]
        this_client_data = data_dict.setdefault(client_name, {header: [] for header in headers})
        for header, s in zip(headers, row):
            s = s.strip()
            this_client_data[header].append(interpret_string(header, s))

# print the results
output_headers = ['**Client Name', 'Job Duration', 'Job File Count', 'Throughput (KB/sec)', 'Protected Data Size(MB)']
# print the headers first
print(', '.join(output_headers))
# print the client name and averages for each client
for server_name, server_data in data_dict.items():
    print_items = []
    for output_header in output_headers:
        header_values = server_data[output_header]
        tools = HEADER_TO_TOOLS.get(output_header)
        if tools:
            print_items.append(str(tools[AVERAGE](header_values)))
        else:
            print_items.append(header_values[0])  # they should all be the same if not numerical
    print(', '.join(print_items))

Result:

**Client Name, Job Duration, Job File Count, Throughput (KB/sec), Protected Data Size(MB)
ambglx21, 0:03:22, 0.0, 0.0, 0.0
dvmpwin048, 0:04:57, 44133.0, 515413.0, 145440.72
ambgsun39, 0:12:00, 0.0, 0.0, 0.0
ambgsun39, 0:11:02, 1.0, 95543.0, 60834.78
dvmpwin040, 0:01:41, 170305.0, 336398.0, 29894.78

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM