简体   繁体   中英

reading a complicated CSV file with python

I didn't know how to title this question so if it should be renamed please inform me and I will.

I'm reading a csv file that I saved off from a piece of equipment I used in taking some measurements. The data and various other key information is all being saved off. I've worked all day long on this and I cannot figure out how to retrieve every piece of information from this file properly. I need to access each piece of data/information and after doing so plot the data and read in the other various information like the data/timestamp/timezone/modelnumber/serialnumber and so on... I apologize if this question is too generic but I'm at lost on how to resolve this.

I've typed up several versions of code so I'll list only what I've been able to get to work. I have no idea why I have to use sep=delimiter . I would think delimiter=',' would work but it doesn't. I've found from research that header=none because my file doesn't have headings.

Canopy is telling me that the engine 'C' will not work so as a result I specify 'python'. From the output of this code it seems to capture everything. But its telling me I only have one column and have no idea how to separate out all this information.

Here is part of my csv file.

    ! FILETYPE CSV              
    ! VERSION 1.0   1           
    ! TIMESTAMP Friday   15 April 2016 04:50:05         
    ! TIMEZONE (GMT+08:00) Kuala Lumpur  Singapore          
    ! NAME Keysight Technologies                
    ! MODEL N9917A              
    ! SERIAL US52240515             
    ! FIRMWARE_VERSION A.08.05              
    ! CORRECTION                
    ! Application SA                
    ! Trace TIMESTAMP: 2016-04-15 04:50:05Z             
    ! Trace GPS Info...             
    ! GPS Latitude:                 
    ! GPS Longitude:                
    ! GPS Seconds Since Last Read: 0                
    ! CHECKSUM 1559060681               
    ! DATA Freq SA Max Hold SA Blank    SA Blank    SA Blank
    ! FREQ UNIT Hz              
    ! DATA UNIT dBm             
    BEGIN               
    2000000000,-62.6893499803169,0,0,0
    2040000000,-64.1528386206532,0,0,0
    2080000000,-63.7751897198055,0,0,0
    2120000000,-63.663056855945,0,0,0
    2160000000,-64.227155790167,0,0,0
    2200000000,-63.874804848758,0,0,0
    END

Here is my code:

import pandas as pd
df = pd.read_csv('/Users/_XXXXXXXXX_/Desktop/TP1_041416_C.csv',
    sep='delimter',
    header=None,
    engine='python')

UPDATE: - this version will read and parse the header part first, it will generate info dictionary from the parsed header and will prepare skiprows list for the pd.read_csv() function

from collections import defaultdict
import io
import re
import pandas as pd
import pprint as pp

fn = r'D:\temp\.data\36671176.data'

def parse_header(filename):
    """
    parses useful (check `header_flt` variable) information from the header
           and saves it into `defaultdict`,
    generates `skiprow` list for `pd.read_csv()`,
    breaks after parsing the header, so the data block will NOT be read

    returns: parsed info as defaultdict obj, skiprows list    
    """
    # useful header information that will be saved into defaultdict
    header_flt = r'TIMESTAMP|TIMEZONE|NAME|MODEL|SERIAL|FIRMWARE_VERSION|Trace TIMESTAMP:'

    with open(fn) as f:
        d = defaultdict(list)
        i = 0
        skiprows = []
        for line in f:
            line = line.strip()
            if line.startswith('!'):
                skiprows.append(i)
                # parse: `key` (first RegEx group)
                #    and `value` (second RegEx group)
                m = re.search(r'!\s+(' + header_flt + ')\s+(.*)\s*', line)
                if m:
                    # print(m.group(1), '->', m.group(2))
                    # save parsed `key` and `value` into defaultdict
                    d[m.group(1)] = m.group(2)
            elif line.startswith('BEGIN'):
                skiprows.append(i)
            else:
                # stop loop if line doesn't start with: '!' or 'BEGIN'
                break
            i += 1
        return d, skiprows

info, skiprows = parse_header(fn)

# parses data block, the header part will be skipped
# NOTE: `comment='E'` - will skip the footer row: "END"
df = pd.read_csv(fn, header=None, usecols=[0,1], names=['freq', 'dBm'],
                 skiprows=skiprows, skipinitialspace=True, comment='E',
                 error_bad_lines=False)

print(df)
print('-' * 60)
pp.pprint(info)

OLD version: - reads the whole file in memory and parses it. It might cause problems with big files because it will allocate memory for the whole file content plus the resulting DF:

from collections import defaultdict
import io
import re
import pandas as pd
import pprint as pp

fn = r'D:\temp\.data\36671176.data'

header_pat = r'(TIMESTAMP|TIMEZONE|NAME|MODEL|SERIAL|FIRMWARE_VERSION)\s+([^\r\n]*?)\s*[\r\n]+'

def parse_file(filename):
    with open(fn) as f:
        txt = f.read()

    m = re.search(r'BEGIN\s*[\r\n]+(.*)[\n\r]+END', txt, flags=re.DOTALL|re.MULTILINE)
    if m:
        data = m.group(1)
        df = pd.read_csv(io.StringIO(data), header=None, usecols=[0,1], names=['freq', 'dBm'])
    else:
        df = pd.DataFrame()

    d = defaultdict(list)
    for m in re.finditer(header_pat, txt, flags=re.S|re.M):
        d[m.group(1)] = m.group(2)

    return df, d

df, info = parse_file(fn)
print(df)
pp.pprint(info)

Output:

         freq        dBm
0  2000000000 -62.689350
1  2040000000 -64.152839
2  2080000000 -63.775190
3  2120000000 -63.663057
4  2160000000 -64.227156
5  2200000000 -63.874805
defaultdict(<class 'list'>,
            {'FIRMWARE_VERSION': 'A.08.05',
             'MODEL': 'N9917A',
             'NAME': 'Keysight Technologies',
             'SERIAL': 'US52240515',
             'TIMESTAMP': 'Friday   15 April 2016 04:50:05',
             'TIMEZONE': '(GMT+08:00) Kuala Lumpur  Singapore'})

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM