简体   繁体   English

pd.read_csv无法加载csv文件的第一列,并且在Excel中打开并保存时文件大小更改

[英]pd.read_csv fails to load the first column of the csv file and file size changes when opened and saved in Excel

I am trying to open an otherwise normal looking csv file generated by an output of a datalogger using Pandas read csv function. 我正在尝试打开一个由使用Pandas read csv函数的数据记录器的输出生成的外观正常的csv文件。 It is noticed that the first column of the file is not loaded into the dataframe. 请注意,文件的第一列未加载到数据框中。 However, when I open the same csv file using Excel and hit save, its file size changes from 1797 Kb (original csv ) to 1658 Kb and now when I try the same read csv function in pandas, the first column is successfully loaded into the dataframe. 但是,当我使用Excel打开相同的csv文件并点击保存时,其文件大小从1797 Kb(原始csv )更改为1658 Kb,现在当我在熊猫中尝试相同的读取csv函数时,第一列已成功加载到数据框。

I would like to know why this is happening, and if I can perform this 'function' on a batch of files without having to manually open and save large number of such csv files using Excel. 我想知道为什么会这样,以及是否可以对一批文件执行此“功能”而不必使用Excel手动打开和保存大量此类csv文件。

I have tried changing the encoding of the file as it gets imported into Excel, I have also tried the pd.read_excel function, but the problem persists. 我尝试过将文件导入Excel中时更改文件的编码,也尝试过pd.read_excel函数,但是问题仍然存在。 I have to give you the original file, otherwise, if I copy some data from the original file and save it in a new csv file, the problem disappears! 我必须给您原始文件,否则,如果我从原始文件中复制了一些数据并将其保存在新的csv文件中,问题就消失了!

df=pd.read_csv("new216.csv") #Loads the csv file into a dataframe:
df.info() 

Actual results: (Note that it says DATE column has all null objects, when in reality it has all non-null objects as seen in Excel). 实际结果:(请注意,它说DATE列包含所有空对象,而实际上它具有所有非空对象(如Excel中所示))。 All other columns are fine. 所有其他列都可以。

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39312 entries, 0 to 39311
Data columns (total 9 columns):
DATE            0 non-null float64
TIME            39311 non-null object
TEMPERATURE     39311 non-null       float64
 PV-VOLTAGE      39311 non-null float64
 PV-CURRENT      39311 non-null float64
 BAT-VOLTAGE     39311 non-null float64
 BAT-CURRENT     39311 non-null float64
 LOAD-CURRENT    39311 non-null float64
 Unnamed: 8      0 non-null float64
 dtypes: float64(8), object(1)
 memory usage: 2.7+ MB

Edit_v1: Here are few lines of the csv file copied from Excel when the csv file was opened in Excel. Edit_v1:这是在Excel中打开csv文件时从Excel复制的csv文件的几行。 Note that if you create a new csv with these values, it works fine, as it should. 请注意,如果使用这些值创建一个新的csv,则可以正常运行。 The problem lies in original csv. 问题出在原始csv中。 Stackoverflow is not giving me an option to share the original file! Stackoverflow没有提供共享原始文件的选项!

DATE    TIME    TEMPERATURE PV-VOLTAGE  PV-CURRENT  BAT-VOLTAGE BAT-CURRENT LOAD-CURRENT
15/07/19    14:56:25    1050    49.9    8.2 49.9    -4.1    12.3
15/07/19    14:56:25    1050    49.9    8.2 49.9    -4.1    12.3
15/07/19    14:57:25    1054    49.2    3.8 49.2    -8.3    12.1
15/07/19    14:58:25    1075    49.7    7.9 49.7    -4.4    12.3
15/07/19    14:59:25    1088    49.2    3.6 49.2    -8.5    12.1
15/07/19    15:00:25    1103    49.1    3.1 49.1    -9  12.1
15/07/19    15:01:25    1114    49.1    2.9 49.1    -9.2    12.1
15/07/19    15:02:26    1131    49.1    3   49.1    -9.1    12.1
15/07/19    15:03:26    1158    49.5    6.9 49.5    -5.3    12.2
15/07/19    15:04:26    1183    49.7    8   49.7    -4.3    12.3
15/07/19    15:05:26    14  52.5    8.3 52.5    8   0.3

Setup: 设定:

import pandas as pd
import io
s = '''DATE    TIME    TEMPERATURE PV-VOLTAGE  PV-CURRENT  BAT-VOLTAGE BAT-CURRENT LOAD-CURRENT
15/07/19    14:56:25    1050    49.9    8.2 49.9    -4.1    12.3
15/07/19    14:56:25    1050    49.9    8.2 49.9    -4.1    12.3
15/07/19    14:57:25    1054    49.2    3.8 49.2    -8.3    12.1
15/07/19    14:58:25    1075    49.7    7.9 49.7    -4.4    12.3
15/07/19    14:59:25    1088    49.2    3.6 49.2    -8.5    12.1
15/07/19    15:00:25    1103    49.1    3.1 49.1    -9  12.1
15/07/19    15:01:25    1114    49.1    2.9 49.1    -9.2    12.1
15/07/19    15:02:26    1131    49.1    3   49.1    -9.1    12.1
15/07/19    15:03:26    1158    49.5    6.9 49.5    -5.3    12.2
15/07/19    15:04:26    1183    49.7    8   49.7    -4.3    12.3
15/07/19    15:05:26    14  52.5    8.3 52.5    8   0.3'''

f = io.StringIO(s)

Specifying whitespace for the Delimiter works. 为定界符指定空白有效。

df = pd.read_csv(f, sep='\s+')
f.seek(0)
dg = pd.read_csv(f,delim_whitespace=True)
f.seek(0)
dh = pd.read_csv(f,delimiter='\s+')

One way to parse the date/time: 解析日期/时间的一种方法:

f.seek(0)
dj = pd.read_csv(f, sep='\s+', parse_dates=[[0,1]])

Lots of options, look through the docs - CSV & text files 有很多选项,可以浏览文档-CSV和文本文件


In [40]: print(df.head().to_string())
       DATE      TIME  TEMPERATURE  PV-VOLTAGE  PV-CURRENT  BAT-VOLTAGE  BAT-CURRENT  LOAD-CURRENT
0  15/07/19  14:56:25         1050        49.9         8.2         49.9         -4.1          12.3
1  15/07/19  14:56:25         1050        49.9         8.2         49.9         -4.1          12.3
2  15/07/19  14:57:25         1054        49.2         3.8         49.2         -8.3          12.1
3  15/07/19  14:58:25         1075        49.7         7.9         49.7         -4.4          12.3
4  15/07/19  14:59:25         1088        49.2         3.6         49.2         -8.5          12.1

In [42]: print(dj.head().to_string())
            DATE_TIME  TEMPERATURE  PV-VOLTAGE  PV-CURRENT  BAT-VOLTAGE  BAT-CURRENT  LOAD-CURRENT
0 2019-07-15 14:56:25         1050        49.9         8.2         49.9         -4.1          12.3
1 2019-07-15 14:56:25         1050        49.9         8.2         49.9         -4.1          12.3
2 2019-07-15 14:57:25         1054        49.2         3.8         49.2         -8.3          12.1
3 2019-07-15 14:58:25         1075        49.7         7.9         49.7         -4.4          12.3
4 2019-07-15 14:59:25         1088        49.2         3.6         49.2         -8.5          12.1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM