How to read a csv with rows of NUL, ('\x00'), into pandas?

Question

I have a set of csv files with Date and Time as the first two columns (no headers in the files). The files open up fine in Excel but when I try to read them into Python using Pandas read_csv, only the first Date is returned, whether or not I try a type conversion.

When I open in Notepad, it's not simply comma separated and has loads of space before each line after line 1 ; I have tried skipinitialspace = True to no avail

I have also tried various type conversions but none work. I am currently using parse_dates = [['Date','Time']], infer_datetime_format = True, dayfirst = True

Example output (no conversion):

             0         1    2    3      4   ...    12    13   14   15   16
0      02/03/20  15:13:39  5.5  5.8  42.84  ...  30.0  79.0  0.0  0.0  0.0
1           NaN  15:13:49  5.5  5.8  42.84  ...  30.0  79.0  0.0  0.0  0.0
2           NaN  15:13:59  5.5  5.7  34.26  ...  30.0  79.0  0.0  0.0  0.0
3           NaN  15:14:09  5.5  5.7  34.26  ...  30.0  79.0  0.0  0.0  0.0
4           NaN  15:14:19  5.5  5.4  17.10  ...  30.0  79.0  0.0  0.0  0.0
...         ...       ...  ...  ...    ...  ...   ...   ...  ...  ...  ...
39451       NaN  01:14:27  5.5  8.4  60.00  ...  30.0  68.0  0.0  0.0  0.0
39452       NaN  01:14:37  5.5  8.4  60.00  ...  30.0  68.0  0.0  0.0  0.0
39453       NaN  01:14:47  5.5  8.4  60.00  ...  30.0  68.0  0.0  0.0  0.0
39454       NaN  01:14:57  5.5  8.4  60.00  ...  30.0  68.0  0.0  0.0  0.0
39455       NaN       NaN  NaN  NaN    NaN  ...   NaN   NaN  NaN  NaN  NaN

And with parse_dates etc:

               Date_Time  pH1 SP pH  Ph1 PV pH  ...    1    2    3
0      02/03/20 15:13:39        5.5        5.8  ...  0.0  0.0  0.0
1           nan 15:13:49        5.5        5.8  ...  0.0  0.0  0.0
2           nan 15:13:59        5.5        5.7  ...  0.0  0.0  0.0
3           nan 15:14:09        5.5        5.7  ...  0.0  0.0  0.0
4           nan 15:14:19        5.5        5.4  ...  0.0  0.0  0.0
...                  ...        ...        ...  ...  ...  ...  ...
39451       nan 01:14:27        5.5        8.4  ...  0.0  0.0  0.0
39452       nan 01:14:37        5.5        8.4  ...  0.0  0.0  0.0
39453       nan 01:14:47        5.5        8.4  ...  0.0  0.0  0.0
39454       nan 01:14:57        5.5        8.4  ...  0.0  0.0  0.0
39455            nan nan        NaN        NaN  ...  NaN  NaN  NaN

Data copied from Notepad (there is actually more whitespace in front of each line but it wouldn't work here):

Data from `67.csv`

02/03/20,15:13:39,5.5,5.8,42.84,7.2,6.8,10.63,60.0,0.0,300,1,30,79,0.0,0.0,         0.0
                                                                                                                                                                                                                                                                                                                                                                                                                                       02/03/20,15:13:49,5.5,5.8,42.84,7.2,6.8,10.63,60.0,0.0,300,1,30,79,0.0,0.0,         0.0
                                                                                                                                                                                                                                                                                                                                                                                                                                       02/03/20,15:13:59,5.5,5.7,34.26,7.2,6.8,10.63,60.0,22.3,300,1,30,79,0.0,0.0,         0.0
                                                                                                                                                                                                                                                                                                                                                                                                                                      02/03/20,15:14:09,5.5,5.7,34.26,7.2,6.8,10.63,60.0,15.3,300,45,30,79,0.0,0.0,         0.0
                                                                                                                                                                                                                                                                                                                                                                                                                                     02/03/20,15:14:19,5.5,5.4,17.10,7.2,6.8,10.63,60.0,50.2,300,86,30,79,0.0,0.0,         0.0

And in Excel (so I know the information is there and readable):

Code

import sys

import numpy as np

import pandas as pd

from datetime import datetime

from tkinter import filedialog
from tkinter import *

def import_file(filename):
    print('\nOpening ' + filename + ":")
    ##Read the data in the file
    df = pd.read_csv(filename, header = None, low_memory = False)
    print(df)
    df['Date_Time'] = pd.to_datetime(df[0] + ' ' + df[1])
    df.drop(columns=[0, 1], inplace=True)
    print(df)

filenames=[]
print('Select files to read, Ctrl or Shift for Multiples')
TkWindow = Tk()
TkWindow.withdraw() # we don't want a full GUI, so keep the root window from appearing
## Show an "Open" dialog box and return the path to the selected file
filenames = filedialog.askopenfilename(title='Open data file', filetypes=(("Comma delimited", "*.csv"),), multiple=True)
TkWindow.destroy()

if len(filenames) == 0:
    print('No files selected - Exiting program.')
    sys.exit()
else:
    print('\n'.join(filenames))

##Read the data from the specified file/s
print('\nReading data file/s')
dfs=[]
for filename in filenames:
    dfs.append(import_file(filename))
if len(dfs) > 1:
    print('\nCombining data files.')

Answer 1

The file is filled with NUL , '\\x00' , which needs to be removed.
Use pandas.DataFrame to load the data from d , after the rows have been cleaned.

import pandas as pd
import string  # to make column names

# the issue is the the file is filled with NUL not whitespace
def import_file(filename):
    # open the file and clean it
    with open(filename) as f:
        d = list(f.readlines())

        # replace NUL, strip whitespace from the end of the strings, split each string into a list
        d = [v.replace('\x00', '').strip().split(',') for v in d]

        # remove some empty rows
        d = [v for v in d if len(v) > 2]

    # load the file with pandas
    df = pd.DataFrame(d)

    # convert column 0 and 1 to a datetime
    df['datetime'] = pd.to_datetime(df[0] + ' ' + df[1])

    # drop column 0 and 1
    df.drop(columns=[0, 1], inplace=True)

    # set datetime as the index
    df.set_index('datetime', inplace=True)

    # convert data in columns to floats
    df = df.astype('float')

    # give character column names
    df.columns = list(string.ascii_uppercase)[:len(df.columns)]
    
    # reset the index
    df.reset_index(inplace=True)
    
    return df.copy()


# call the function
dfs = list()
filenames = ['67.csv']
for filename in filenames:
    
    dfs.append(import_file(filename))

`display(df)`

                       A    B      C    D    E      F     G     H      I     J     K     L    M    N    O
datetime                                                                                                 
2020-02-03 15:13:39  5.5  5.8  42.84  7.2  6.8  10.63  60.0   0.0  300.0   1.0  30.0  79.0  0.0  0.0  0.0
2020-02-03 15:13:49  5.5  5.8  42.84  7.2  6.8  10.63  60.0   0.0  300.0   1.0  30.0  79.0  0.0  0.0  0.0
2020-02-03 15:13:59  5.5  5.7  34.26  7.2  6.8  10.63  60.0  22.3  300.0   1.0  30.0  79.0  0.0  0.0  0.0
2020-02-03 15:14:09  5.5  5.7  34.26  7.2  6.8  10.63  60.0  15.3  300.0  45.0  30.0  79.0  0.0  0.0  0.0
2020-02-03 15:14:19  5.5  5.4  17.10  7.2  6.8  10.63  60.0  50.2  300.0  86.0  30.0  79.0  0.0  0.0  0.0

How to read a csv with rows of NUL, ('\x00'), into pandas?

Question

Data from `67.csv`

Code

1 answers

solution1
1 ACCPTED 2020-09-24 18:29:46

`display(df)`

How to read a csv with rows of NUL, ('\x00'), into pandas?

Question

Data from 67.csv

Code

1 answers

solution1 1 ACCPTED 2020-09-24 18:29:46

display(df)

Data from `67.csv`

solution1
1 ACCPTED 2020-09-24 18:29:46

`display(df)`