使用熊猫read_csv时跳过0xff字节

Question

我正在尝试从锅炉读取一些日志文件，但是它们的格式不正确。

当我尝试读取文件时

import pandas

print(pandas.read_csv('./data/CM120102.CSV', delimiter=';'))

我懂了

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 49: invalid start byte

出于某些原因，csv标头以空字节结尾。

https://gist.github.com/Ession/6e5bf67392276048c7bd

http://mathiasjost.com/CM120102.CSV <==这应该可以（或者不可以）

有什么方法可以在不先修复的情况下用熊猫读取这些文件？

Answer 1

我会读成一个字符串。 然后在将其传递给pandas.read_csv之前，在python中进行一些调整。 示例代码如下。

# get the data as a python string
with open ("CM120102.CSV", "r") as myfile:
    data=myfile.read()

# munge in python - get rid of the garbage in the input (lots of xff bytes)
import re
data = re.sub(r'[^a-zA-Z0-9_\.;:\n]', '', data) # get rid of the rubbish
data = data + '\n' # the very last one is missing?
data = re.sub(r';\n', r'\n', data) # last ; separator on line is problematic

# now let's suck into a pandas DataFrame
from StringIO import StringIO
import pandas as pd
df = pd.read_csv(StringIO(data), index_col=None, header=0,
    skipinitialspace=True, sep=';', parse_dates=True)

使用熊猫read_csv时跳过0xff字节

问题描述

1 个解决方案

解决方案1
3 已采纳 2015-04-07 10:03:00

使用熊猫read_csv时跳过0xff字节

问题描述

1 个解决方案

解决方案1 3 已采纳 2015-04-07 10:03:00

解决方案1
3 已采纳 2015-04-07 10:03:00