简体   繁体   English

在read_csv中跳过缺少值的行

[英]Skip rows with missing values in read_csv

I have a very large csv which I need to read in. To make this fast and save RAM usage I am using read_csv and set the dtype of some columns to np.uint32. 我有一个非常大的csv,我需要阅读。为了使这快速和节省RAM使用我使用read_csv并将一些列的dtype设置为np.uint32。 The problem is that some rows have missing values and pandas uses a float to represent those. 问题是某些行缺少值,而pandas使用float来表示这些值。

  1. Is it possible to simply skip rows with missing values? 是否可以简单地跳过缺少值的行? I know I could do this after reading in the whole file but this means I couldn't set the dtype until then and so would use too much RAM. 我知道在读完整个文件后我可以这样做,但这意味着我不能在那之前设置dtype,所以会使用太多RAM。
  2. Is it possible to convert missing values to some other I choose during the reading of the data? 是否可以将缺失值转换为我在读取数据期间选择的其他值?

It would be dainty if you could fill NaN with say 0 during read itself. 如果你能在读取过程中用Na 0填充NaN ,那将是很精致的。 Perhaps a feature request in Pandas's git-hub is in order... 也许Pandas的git-hub中的功能请求是有序的......

Using a converter function 使用转换器功能

However, for the time being, you can define your own function to do that and pass it to the converters argument in read_csv : 但是,暂时,您可以定义自己的函数来执行此操作并将其传递给read_csvconverters参数:

def conv(val):
    if val == np.nan:
        return 0 # or whatever else you want to represent your NaN with
    return val

df = pd.read_csv(file, converters={colWithNaN : conv}, dtypes=...)

Note that converters takes a dict , so you need to specify it for each column that has NaN to be dealt with. 请注意, converters采用dict ,因此您需要为每个要处理NaN的列指定它。 It can get a little tiresome if a lot of columns are affected. 如果很多列受到影响,它可能会有点令人厌烦。 You can specify either column names or numbers as keys. 您可以指定列名称或数字作为键。

Also note that this might slow down your read_csv performance, depending on how the converters function is handled. 另请注意,这可能会降低read_csv性能,具体取决于converters功能的处理方式。 Further, if you just have one column that needs NaNs handled during read, you can skip a proper function definition and use a lambda function instead: 此外,如果您只有一列需要在读取期间处理NaN,则可以跳过正确的函数定义并使用lambda函数:

df = pd.read_csv(file, converters={colWithNaN : lambda x: 0 if x == np.nan else x}, dtypes=...)

Reading in chunks 阅读大块

You could also read the file in small chunks that you stitch together to get your final output. 你也可以用你缝合在一起的小块来读取文件以获得最终输出。 You can do a bunch of things this way. 你可以用这种方式做很多事情。 Here is an illustrative example: 这是一个说明性的例子:

result = pd.DataFrame()
df = pd.read_csv(file, chunksize=1000)
for chunk in df:
    chunk.dropna(axis=0, inplace=True) # Dropping all rows with any NaN value
    chunk[colToConvert] = chunk[colToConvert].astype(np.uint32)
    result = result.append(chunk)
del df, chunk

Note that this method does not strictly duplicate data. 请注意,此方法不严格复制数据。 There is a time when the data in chunk exists twice, right after the result.append statement, but only chunksize rows are repeated, which is a fair bargain. 有一段时间, chunk的数据存在两次, result.appendresult.append语句之后,但只重复了chunksize行,这是一个公平的讨价还价。 This method may also work out to be faster than by using a converter function. 该方法也可以比使用转换器功能更快。

There is no feature in Pandas that does that. Pandas没有这样做的功能。 You can implement it in regular Python like this: 您可以在常规Python中实现它,如下所示:

import csv
import pandas as pd

def filter_records(records):
    """Given an iterable of dicts, converts values to int.
    Discards any record which has an empty field."""

    for record in records:
        for k, v in record.iteritems():
            if v == '':
                break
            record[k] = int(v)
        else: # this executes whenever break did not
            yield record

with open('t.csv') as infile:
    records = csv.DictReader(infile)
    df = pd.DataFrame.from_records(filter_records(records))

Pandas uses the csv module internally anyway. 无论如何,Pandas内部使用csv模块。 If the performance of the above turns out to be a problem, you could probably speed it up with Cython (which Pandas also uses). 如果上面的表现证明是一个问题,你可以用Cython加速它(Pandas也使用它)。

If you show some data, SO ppl could help. 如果您显示一些数据,SO ppl可能有所帮助。

pd.read_csv('FILE', keep_default_na=False)

For starters try these: 首先尝试这些:

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

na_values : str or list-like or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘nan’.

keep_default_na : bool, default True
If na_values are specified and keep_default_na is False the default NaN values are overridden, otherwise they’re appended to.

na_filter : boolean, default True
    Detect missing value markers (empty strings and the value of na_values). In data without any NAs, passing na_filter=False can improve the performance of reading a large file

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM