简体   繁体   English

如何将包含 NUL 行 ('\\x00') 的 csv 读入熊猫?

[英]How to read a csv with rows of NUL, ('\x00'), into pandas?

I have a set of csv files with Date and Time as the first two columns (no headers in the files).我有一组 csv 文件,日期和时间作为前两列(文件中没有标题)。 The files open up fine in Excel but when I try to read them into Python using Pandas read_csv, only the first Date is returned, whether or not I try a type conversion.这些文件在 Excel 中打开得很好,但是当我尝试使用 Pandas read_csv 将它们读入 Python 时,无论我是否尝试类型转换,都只返回第一个日期。

When I open in Notepad, it's not simply comma separated and has loads of space before each line after line 1 ;当我在记事本中打开时,它不是简单的逗号分隔,并且在第 1 行之后的每一行之前都有大量空格 I have tried skipinitialspace = True to no avail我试过skipinitialspace = True无济于事

I have also tried various type conversions but none work.我也尝试过各种类型转换,但都没有奏效。 I am currently using parse_dates = [['Date','Time']], infer_datetime_format = True, dayfirst = True我目前正在使用parse_dates = [['Date','Time']], infer_datetime_format = True, dayfirst = True

Example output (no conversion):示例输出(无转换):

             0         1    2    3      4   ...    12    13   14   15   16
0      02/03/20  15:13:39  5.5  5.8  42.84  ...  30.0  79.0  0.0  0.0  0.0
1           NaN  15:13:49  5.5  5.8  42.84  ...  30.0  79.0  0.0  0.0  0.0
2           NaN  15:13:59  5.5  5.7  34.26  ...  30.0  79.0  0.0  0.0  0.0
3           NaN  15:14:09  5.5  5.7  34.26  ...  30.0  79.0  0.0  0.0  0.0
4           NaN  15:14:19  5.5  5.4  17.10  ...  30.0  79.0  0.0  0.0  0.0
...         ...       ...  ...  ...    ...  ...   ...   ...  ...  ...  ...
39451       NaN  01:14:27  5.5  8.4  60.00  ...  30.0  68.0  0.0  0.0  0.0
39452       NaN  01:14:37  5.5  8.4  60.00  ...  30.0  68.0  0.0  0.0  0.0
39453       NaN  01:14:47  5.5  8.4  60.00  ...  30.0  68.0  0.0  0.0  0.0
39454       NaN  01:14:57  5.5  8.4  60.00  ...  30.0  68.0  0.0  0.0  0.0
39455       NaN       NaN  NaN  NaN    NaN  ...   NaN   NaN  NaN  NaN  NaN

And with parse_dates etc:并使用 parse_dates 等:

               Date_Time  pH1 SP pH  Ph1 PV pH  ...    1    2    3
0      02/03/20 15:13:39        5.5        5.8  ...  0.0  0.0  0.0
1           nan 15:13:49        5.5        5.8  ...  0.0  0.0  0.0
2           nan 15:13:59        5.5        5.7  ...  0.0  0.0  0.0
3           nan 15:14:09        5.5        5.7  ...  0.0  0.0  0.0
4           nan 15:14:19        5.5        5.4  ...  0.0  0.0  0.0
...                  ...        ...        ...  ...  ...  ...  ...
39451       nan 01:14:27        5.5        8.4  ...  0.0  0.0  0.0
39452       nan 01:14:37        5.5        8.4  ...  0.0  0.0  0.0
39453       nan 01:14:47        5.5        8.4  ...  0.0  0.0  0.0
39454       nan 01:14:57        5.5        8.4  ...  0.0  0.0  0.0
39455            nan nan        NaN        NaN  ...  NaN  NaN  NaN

Data copied from Notepad (there is actually more whitespace in front of each line but it wouldn't work here):从记事本复制的数据(每行前面实际上有更多空格,但在这里不起作用):

Data from 67.csv来自67.csv数据

在此处输入图片说明

02/03/20,15:13:39,5.5,5.8,42.84,7.2,6.8,10.63,60.0,0.0,300,1,30,79,0.0,0.0,         0.0
                                                                                                                                                                                                                                                                                                                                                                                                                                       02/03/20,15:13:49,5.5,5.8,42.84,7.2,6.8,10.63,60.0,0.0,300,1,30,79,0.0,0.0,         0.0
                                                                                                                                                                                                                                                                                                                                                                                                                                       02/03/20,15:13:59,5.5,5.7,34.26,7.2,6.8,10.63,60.0,22.3,300,1,30,79,0.0,0.0,         0.0
                                                                                                                                                                                                                                                                                                                                                                                                                                      02/03/20,15:14:09,5.5,5.7,34.26,7.2,6.8,10.63,60.0,15.3,300,45,30,79,0.0,0.0,         0.0
                                                                                                                                                                                                                                                                                                                                                                                                                                     02/03/20,15:14:19,5.5,5.4,17.10,7.2,6.8,10.63,60.0,50.2,300,86,30,79,0.0,0.0,         0.0

And in Excel (so I know the information is there and readable):在 Excel 中(所以我知道信息在那里并且可读):

在 Excel 中打开的相同数据的屏幕截图

Code代码

import sys

import numpy as np

import pandas as pd

from datetime import datetime

from tkinter import filedialog
from tkinter import *

def import_file(filename):
    print('\nOpening ' + filename + ":")
    ##Read the data in the file
    df = pd.read_csv(filename, header = None, low_memory = False)
    print(df)
    df['Date_Time'] = pd.to_datetime(df[0] + ' ' + df[1])
    df.drop(columns=[0, 1], inplace=True)
    print(df)

filenames=[]
print('Select files to read, Ctrl or Shift for Multiples')
TkWindow = Tk()
TkWindow.withdraw() # we don't want a full GUI, so keep the root window from appearing
## Show an "Open" dialog box and return the path to the selected file
filenames = filedialog.askopenfilename(title='Open data file', filetypes=(("Comma delimited", "*.csv"),), multiple=True)
TkWindow.destroy()

if len(filenames) == 0:
    print('No files selected - Exiting program.')
    sys.exit()
else:
    print('\n'.join(filenames))

##Read the data from the specified file/s
print('\nReading data file/s')
dfs=[]
for filename in filenames:
    dfs.append(import_file(filename))
if len(dfs) > 1:
    print('\nCombining data files.')
  • The file is filled with NUL , '\\x00' , which needs to be removed.该文件填充了NUL'\\x00' ,需要删除。
  • Use pandas.DataFrame to load the data from d , after the rows have been cleaned.在行被清理之后,使用pandas.DataFramed加载数据。
import pandas as pd
import string  # to make column names

# the issue is the the file is filled with NUL not whitespace
def import_file(filename):
    # open the file and clean it
    with open(filename) as f:
        d = list(f.readlines())

        # replace NUL, strip whitespace from the end of the strings, split each string into a list
        d = [v.replace('\x00', '').strip().split(',') for v in d]

        # remove some empty rows
        d = [v for v in d if len(v) > 2]

    # load the file with pandas
    df = pd.DataFrame(d)

    # convert column 0 and 1 to a datetime
    df['datetime'] = pd.to_datetime(df[0] + ' ' + df[1])

    # drop column 0 and 1
    df.drop(columns=[0, 1], inplace=True)

    # set datetime as the index
    df.set_index('datetime', inplace=True)

    # convert data in columns to floats
    df = df.astype('float')

    # give character column names
    df.columns = list(string.ascii_uppercase)[:len(df.columns)]
    
    # reset the index
    df.reset_index(inplace=True)
    
    return df.copy()


# call the function
dfs = list()
filenames = ['67.csv']
for filename in filenames:
    
    dfs.append(import_file(filename))

display(df)

                       A    B      C    D    E      F     G     H      I     J     K     L    M    N    O
datetime                                                                                                 
2020-02-03 15:13:39  5.5  5.8  42.84  7.2  6.8  10.63  60.0   0.0  300.0   1.0  30.0  79.0  0.0  0.0  0.0
2020-02-03 15:13:49  5.5  5.8  42.84  7.2  6.8  10.63  60.0   0.0  300.0   1.0  30.0  79.0  0.0  0.0  0.0
2020-02-03 15:13:59  5.5  5.7  34.26  7.2  6.8  10.63  60.0  22.3  300.0   1.0  30.0  79.0  0.0  0.0  0.0
2020-02-03 15:14:09  5.5  5.7  34.26  7.2  6.8  10.63  60.0  15.3  300.0  45.0  30.0  79.0  0.0  0.0  0.0
2020-02-03 15:14:19  5.5  5.4  17.10  7.2  6.8  10.63  60.0  50.2  300.0  86.0  30.0  79.0  0.0  0.0  0.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM