如何将包含 NUL 行 ('\\x00') 的 csv 读入熊猫？

Question

我有一组 csv 文件，日期和时间作为前两列（文件中没有标题）。 这些文件在 Excel 中打开得很好，但是当我尝试使用 Pandas read_csv 将它们读入 Python 时，无论我是否尝试类型转换，都只返回第一个日期。

当我在记事本中打开时，它不是简单的逗号分隔，并且在第 1 行之后的每一行之前都有大量空格； 我试过skipinitialspace = True无济于事

我也尝试过各种类型转换，但都没有奏效。 我目前正在使用parse_dates = [['Date','Time']], infer_datetime_format = True, dayfirst = True

示例输出（无转换）：

             0         1    2    3      4   ...    12    13   14   15   16
0      02/03/20  15:13:39  5.5  5.8  42.84  ...  30.0  79.0  0.0  0.0  0.0
1           NaN  15:13:49  5.5  5.8  42.84  ...  30.0  79.0  0.0  0.0  0.0
2           NaN  15:13:59  5.5  5.7  34.26  ...  30.0  79.0  0.0  0.0  0.0
3           NaN  15:14:09  5.5  5.7  34.26  ...  30.0  79.0  0.0  0.0  0.0
4           NaN  15:14:19  5.5  5.4  17.10  ...  30.0  79.0  0.0  0.0  0.0
...         ...       ...  ...  ...    ...  ...   ...   ...  ...  ...  ...
39451       NaN  01:14:27  5.5  8.4  60.00  ...  30.0  68.0  0.0  0.0  0.0
39452       NaN  01:14:37  5.5  8.4  60.00  ...  30.0  68.0  0.0  0.0  0.0
39453       NaN  01:14:47  5.5  8.4  60.00  ...  30.0  68.0  0.0  0.0  0.0
39454       NaN  01:14:57  5.5  8.4  60.00  ...  30.0  68.0  0.0  0.0  0.0
39455       NaN       NaN  NaN  NaN    NaN  ...   NaN   NaN  NaN  NaN  NaN

并使用 parse_dates 等：

               Date_Time  pH1 SP pH  Ph1 PV pH  ...    1    2    3
0      02/03/20 15:13:39        5.5        5.8  ...  0.0  0.0  0.0
1           nan 15:13:49        5.5        5.8  ...  0.0  0.0  0.0
2           nan 15:13:59        5.5        5.7  ...  0.0  0.0  0.0
3           nan 15:14:09        5.5        5.7  ...  0.0  0.0  0.0
4           nan 15:14:19        5.5        5.4  ...  0.0  0.0  0.0
...                  ...        ...        ...  ...  ...  ...  ...
39451       nan 01:14:27        5.5        8.4  ...  0.0  0.0  0.0
39452       nan 01:14:37        5.5        8.4  ...  0.0  0.0  0.0
39453       nan 01:14:47        5.5        8.4  ...  0.0  0.0  0.0
39454       nan 01:14:57        5.5        8.4  ...  0.0  0.0  0.0
39455            nan nan        NaN        NaN  ...  NaN  NaN  NaN

从记事本复制的数据（每行前面实际上有更多空格，但在这里不起作用）：

来自`67.csv`数据

02/03/20,15:13:39,5.5,5.8,42.84,7.2,6.8,10.63,60.0,0.0,300,1,30,79,0.0,0.0,         0.0
                                                                                                                                                                                                                                                                                                                                                                                                                                       02/03/20,15:13:49,5.5,5.8,42.84,7.2,6.8,10.63,60.0,0.0,300,1,30,79,0.0,0.0,         0.0
                                                                                                                                                                                                                                                                                                                                                                                                                                       02/03/20,15:13:59,5.5,5.7,34.26,7.2,6.8,10.63,60.0,22.3,300,1,30,79,0.0,0.0,         0.0
                                                                                                                                                                                                                                                                                                                                                                                                                                      02/03/20,15:14:09,5.5,5.7,34.26,7.2,6.8,10.63,60.0,15.3,300,45,30,79,0.0,0.0,         0.0
                                                                                                                                                                                                                                                                                                                                                                                                                                     02/03/20,15:14:19,5.5,5.4,17.10,7.2,6.8,10.63,60.0,50.2,300,86,30,79,0.0,0.0,         0.0

在 Excel 中（所以我知道信息在那里并且可读）：

代码

import sys

import numpy as np

import pandas as pd

from datetime import datetime

from tkinter import filedialog
from tkinter import *

def import_file(filename):
    print('\nOpening ' + filename + ":")
    ##Read the data in the file
    df = pd.read_csv(filename, header = None, low_memory = False)
    print(df)
    df['Date_Time'] = pd.to_datetime(df[0] + ' ' + df[1])
    df.drop(columns=[0, 1], inplace=True)
    print(df)

filenames=[]
print('Select files to read, Ctrl or Shift for Multiples')
TkWindow = Tk()
TkWindow.withdraw() # we don't want a full GUI, so keep the root window from appearing
## Show an "Open" dialog box and return the path to the selected file
filenames = filedialog.askopenfilename(title='Open data file', filetypes=(("Comma delimited", "*.csv"),), multiple=True)
TkWindow.destroy()

if len(filenames) == 0:
    print('No files selected - Exiting program.')
    sys.exit()
else:
    print('\n'.join(filenames))

##Read the data from the specified file/s
print('\nReading data file/s')
dfs=[]
for filename in filenames:
    dfs.append(import_file(filename))
if len(dfs) > 1:
    print('\nCombining data files.')

Answer 1

该文件填充了NUL ， '\\x00' ，需要删除。
在行被清理之后，使用pandas.DataFrame从d加载数据。

import pandas as pd
import string  # to make column names

# the issue is the the file is filled with NUL not whitespace
def import_file(filename):
    # open the file and clean it
    with open(filename) as f:
        d = list(f.readlines())

        # replace NUL, strip whitespace from the end of the strings, split each string into a list
        d = [v.replace('\x00', '').strip().split(',') for v in d]

        # remove some empty rows
        d = [v for v in d if len(v) > 2]

    # load the file with pandas
    df = pd.DataFrame(d)

    # convert column 0 and 1 to a datetime
    df['datetime'] = pd.to_datetime(df[0] + ' ' + df[1])

    # drop column 0 and 1
    df.drop(columns=[0, 1], inplace=True)

    # set datetime as the index
    df.set_index('datetime', inplace=True)

    # convert data in columns to floats
    df = df.astype('float')

    # give character column names
    df.columns = list(string.ascii_uppercase)[:len(df.columns)]
    
    # reset the index
    df.reset_index(inplace=True)
    
    return df.copy()


# call the function
dfs = list()
filenames = ['67.csv']
for filename in filenames:
    
    dfs.append(import_file(filename))

`display(df)`

                       A    B      C    D    E      F     G     H      I     J     K     L    M    N    O
datetime                                                                                                 
2020-02-03 15:13:39  5.5  5.8  42.84  7.2  6.8  10.63  60.0   0.0  300.0   1.0  30.0  79.0  0.0  0.0  0.0
2020-02-03 15:13:49  5.5  5.8  42.84  7.2  6.8  10.63  60.0   0.0  300.0   1.0  30.0  79.0  0.0  0.0  0.0
2020-02-03 15:13:59  5.5  5.7  34.26  7.2  6.8  10.63  60.0  22.3  300.0   1.0  30.0  79.0  0.0  0.0  0.0
2020-02-03 15:14:09  5.5  5.7  34.26  7.2  6.8  10.63  60.0  15.3  300.0  45.0  30.0  79.0  0.0  0.0  0.0
2020-02-03 15:14:19  5.5  5.4  17.10  7.2  6.8  10.63  60.0  50.2  300.0  86.0  30.0  79.0  0.0  0.0  0.0

如何将包含 NUL 行 ('\\x00') 的 csv 读入熊猫？

问题描述

来自`67.csv`数据

代码

1 个解决方案

解决方案1
1 已采纳 2020-09-24 18:29:46

`display(df)`

如何将包含 NUL 行 (&#39;\\x00&#39;) 的 csv 读入熊猫？

问题描述

来自67.csv数据

代码

1 个解决方案

解决方案1 1 已采纳 2020-09-24 18:29:46

display(df)

如何将包含 NUL 行 ('\\x00') 的 csv 读入熊猫？

来自`67.csv`数据

解决方案1
1 已采纳 2020-09-24 18:29:46

`display(df)`