简体   繁体   中英

First row from a CSV file is a list of one string while the remaining rows are list of elements in Python

I am a newbie and trying to perform some calculations and analyse some data from a huge number of CSV files using Python 3. Each CSV file has data stored in about 30000 rows seperated by a semicolon. As these are system generated files, the first row has the names of Parameters in a different order for each file. So I want to extract these Parameter names, compare them and find out the positions of required parameter values for every file.

But before I even do that, when I read the first row of the CSV file, the list has a single string seperated by the ';' while the rest of the lines are properly extracted. I have tried splitting the text and other means but none of them have any effect on the output. Can someone kindly help solve this issue.

This is the piece of code I have written:

import os 
import sys
import csv
import math      
# Open a file
full_path = "C:\\Documents and Settings\\My Documents\\CSV files"
dirs = os.listdir( full_path )
filename = list()

"""This will find the file in the directory and print the first row of the 
CSV file - the names of the parameters."""
for file in dirs:
    path = full_path+'\\'+file
    with open(path, 'rt') as csvfile:
        #For printing the location name   
        Turbine_name = (os.path.splitext(file)[0])
        #Reading the files into an object
        csvreader = csv.reader(csvfile, delimiter=';')
        #Obtaining the first line of the CSV file - the names of the parameters
        Parameters = next(csvreader)
        print (Parameters)
        #From this line, the values of different Parameters are extracted
        for row in csvreader:
            Parameters = next(csvreader)
            print (Parameters)
            #To eliminate rows with insufficient data
            if len(Parameters)>11:            
                Created_time = Parameters[0]
                Grid_frequency = float(Parameters[3])
                Grid_KW = float(Parameters[4])
                Rotor_RPM = float(Parameters[10])

The positions of the parameters are taken for example here as I am yet to write the code for comparing strings. A sample of the input file looks like this:

createTime;"Grid CosPhi";"Grid Current";"Grid Frequency";"Grid kW";"Grid   VAr";"Grid Voltage";"Pitch angle 1";"Pitch angle 2";"Pitch angle 3";"Rotor RPM";"Temp.  5    214";"Temp.  6    217";"Temp.  9    227";"Winddirection";"Windspeed"
9/21/14 11:30:01 PM;N/A;N/A;49.963;211688.734;N/A;N/A;-1.06;-1.039;-1.119;19.379;47.167;36;64;N/A;6.319
9/21/14 11:40:01 PM;N/A;N/A;50.002;170096.297;N/A;N/A;-1.003;-0.96;-1.058;19.446;47.182;36.084;63.772;N/A;5.628
9/21/14 11:50:01 PM;N/A;N/A;50.021;175038.734;N/A;N/A;-0.976;-0.961;-1.082;18.805;47;36.223;63.153;N/A;5.577
9/22/14 12:00:01 AM;N/A;N/A;49.964;229942.016;N/A;N/A;-1.047;-1.018;-1.066;18.957;47.125;36.293;63.766;N/A;6.431
9/22/14 12:10:01 AM;N/A;N/A;49.908;200873.844;N/A;N/A;-0.997;-0.985;-1.06;19.229;47.028;36.334;63.962;N/A;6.076
9/22/14 12:20:01 AM;N/A;N/A;49.934;234467.609;N/A;N/A;-1.028;-0.986;-1.001;18.995;47.056;36.401;63.732;N/A;6.067
9/22/14 12:30:01 AM;N/A;N/A;49.96;270969.25;N/A;N/A;-1.138;-1.103;-1.122;18.983;47.274;36.499;64.014;N/A;6.487
9/23/14 12:30:01 PM;N/A;N/A;50.073;54458.719;N/A;N/A;-0.492;-0.52;-0.472;13;42.02;33.624;46.426;N/A;3.757

And when I get the output it looks like this:

>>>['createTime;"Grid CosPhi";"Grid Current";"Grid Frequency";"Grid kW";"Grid VAr";"Grid Voltage";"Pitch angle 1";"Pitch angle 2";"Pitch angle 3";"Rotor RPM";"Temp.  5    214";"Temp.  6    217";"Temp.  9    227";"Winddirection";"Windspeed"']
   ['9/21/14 11:40:01 PM', 'N/A', 'N/A', '50.002', '170096.297', 'N/A', 'N/A', '-1.003', '-0.96', '-1.058', '19.446', '47.182', '36.084', '63.772', 'N/A', '5.628']
   ['9/22/14 12:00:01 AM', 'N/A', 'N/A', '49.964', '229942.016', 'N/A', 'N/A', '-1.047', '-1.018', '-1.066', '18.957', '47.125', '36.293', '63.766', 'N/A', '6.431']
   ['9/22/14 12:20:01 AM', 'N/A', 'N/A', '49.934', '234467.609', 'N/A', 'N/A', '-1.028', '-0.986', '-1.001', '18.995', '47.056', '36.401', '63.732', 'N/A', '6.067']

As observed in some of the files, some of the Parameters are completely missing. This is why I need to find out positions of individual parameters in each CSV file. Any ideas how it can be done best are also appreciated. Thanks in advance.

Edit: Unfortunately I cannot use Pandas as it has to be seperately installed in Python34 and my office system (Windows XP, P4) does not support it. I want to do this with CSV module if possible.

I would suggest that you use pandas instead of csv as it is more capable of reading and creating spreadsheets. The first line of each file is usually used to name the columns. As each column has an unique name you do not need to know the position, since in pandas columns can be accessed by name.

import pandas as pd
data = pd.read_csv(yourFilePath, sep = ';', index_col = 0, parse_dates = True)

That would create a spreadsheet similar to MS Excel. The first column 'createTime' will be the index of your spreadsheet and pandas will try to parse the strings in this column to date/time information.

If you like to do some math now, you can do it column by column as pandas is able to interprete that correctly. For selection rows you can use selection by labels :

import pandas as pd
data1 = pd.read_csv(yourFilePath1, sep = ';', index_col = 0, parse_dates = True)
data2 = pd.read_csv(yourFilePath2, sep = ';', index_col = 0, parse_dates = True)
#select a specific time range only, e.g 22th Sept. 2014
range_of_interest = data.loc['20140922']
#select time range >= Sept. 22th, 2014
range_of_interest = data.loc['20140922':]

You can also align your dataframes according to the index. Here is a similar example

aligned_data = data1.join(data2, how = 'outer', rsuffic = '_2')

There are also multiple functions that allow you to calculate descriptions of the columns like mean value, standard deviation, etc. For example, the stastical values for the column 'Grid kW':

data['Grid kW'].describe()

Finally, you can do math with complete dataframes. pandas will automatically use the correct columns. If a speficic column is missing in one of the dataframes, the results of the calculation will be 'nan'. Same is valid for missing values. So the result of the calculation between the dataframes will always result in a dataframe with valid data only.

data1.loc['20140922'] + data2.loc['20140922']

That should cover your intentions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM