使用熊猫将具有缺失值的csv数据读取到python中

Question

I have a CSV-file looking like this: 我有一个CSV文件，看起来像这样：

"row ID","label","val"
"Row0","5",6
"Row1","",6
"Row2","",6
"Row3","5",7
"Row4","5",8
"Row5",,9
"Row6","nan",
"Row7","nan",
"Row8","nan",0
"Row9","nan",3
"Row10","nan",

All quoted entries are strings. 所有引用的条目都是字符串。 Non-quoted entries are numerical. 未引用的条目是数字。 Empty fields are missing values (NaN), Quoted empty fields still should be considered as empty strings. 空字段缺少值（NaN），带引号的空字段仍应视为空字符串。 I tried to read it in with pandas read_csv but I cannot get it working the way I would like to have it... It still consideres ,"", and ,, as NaN, while it's not true for the first one. 我尝试使用pandas read_csv读取它，但是我无法使其以我想要的方式工作……它仍然将Nas视为“，”“和”，而对于第一个则不正确。

d = pd.read_csv(csv_filename, sep=',', keep_default_na=False, na_values=[''], quoting = csv.QUOTE_NONNUMERIC)

Can anybody help? 有人可以帮忙吗？ Is it possible at all? 有可能吗？

Answer 1

You can try with numpy.genfromtxt and specify the missing_values parameter 您可以尝试使用numpy.genfromtxt并指定missing_values参数

http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html

Answer 2

Maybe something like: 也许像这样：

import pandas as pd
import csv
import numpy as np
d = pd.read_csv('test.txt', sep=',', keep_default_na=False, na_values=[''], quoting = csv.QUOTE_NONNUMERIC)
mask = d['label'] == 'nan'
d.label[mask] = np.nan

Answer 3

I found a way to get it more or less working. 我找到了使它或多或少起作用的方法。 I just don't know, why I need to specify dtype=type(None) to get it working... Comments on this piece of code are very welcome! 我只是不知道，为什么我需要指定dtype = type（None）才能使其正常工作...非常欢迎对此段代码发表评论！

import re
import pandas as pd
import numpy as np

# clear quoting characters
def filterTheField(s):
    m = re.match(r'^"?(.*)?"$', s.strip())
    if m:
        return m.group(1)
    else:
        return np.nan

file = 'test.csv'

y = np.genfromtxt(file, delimiter = ',', filling_values = np.nan, names = True, dtype = type(None), converters = {'row_ID': filterTheField, 'label': filterTheField,'val': float})

d = pd.DataFrame(y)

print(d)

使用熊猫将具有缺失值的csv数据读取到python中

问题描述

3 个解决方案

解决方案1
1 2014-12-01 14:16:25

解决方案2
0 2014-12-01 13:59:54

解决方案3
0 已采纳 2014-12-03 09:29:17

使用熊猫将具有缺失值的csv数据读取到python中

问题描述

3 个解决方案

解决方案1 1 2014-12-01 14:16:25

解决方案2 0 2014-12-01 13:59:54

解决方案3 0 已采纳 2014-12-03 09:29:17

解决方案1
1 2014-12-01 14:16:25

解决方案2
0 2014-12-01 13:59:54

解决方案3
0 已采纳 2014-12-03 09:29:17