简体   繁体   English

使 pandas.read_csv 不添加列分隔符少于主行的行

[英]make pandas.read_csv to not add lines with less columns delimiters than the main lines

Using pandas.read_csv with on_bad_lines='warn' option for lines with too many columns delimiters, it work well, bad lines are not loaded and stderr catch the bad lines numbers:pandas.read_csvon_bad_lines='warn'选项一起用于具有太多列分隔符的行,它运行良好,不加载坏行并且 stderr 捕获坏行号:

    import pandas as pd
    from io import StringIO
    data = StringIO("""
    nom,f,nb
    bat,F,52
    cat,M,66,
    caw,F,15
    dog,M,66,,
    fly,F,61
    ant,F,21""")
    df = pd.read_csv(data, sep=',', on_bad_lines='warn')

    # b'Skipping line 4: expected 3 fields, saw 4\nSkipping line 6: expected 3 fields, saw 5\n'

    df.head(10)
    #    nom  f  nb
    # 0  bat  F  52
    # 1  caw  F  15
    # 2  fly  F  61
    # 3  ant  F  21

But in case the number of delimiter (here sep=, ) is less than the main, the line is added adding NaN .:但如果分隔符的数量(此处为sep=, )小于主分隔符,则添加该行并添加NaN 。:

    import pandas as pd
    from io import StringIO
    data = StringIO("""
    nom,f,nb
    bat,F,52
    catM66,
    caw,F,15
    dog,M66
    fly,F,61
    ant,F,21""")
    df = pd.read_csv(data, sep=',', on_bad_lines='warn', dtype=str)
    df.head(10)

    #       nom    f   nb
    # 0     bat    F   52
    # 1  catM66  NaN  NaN            <==
    # 2     caw    F   15
    # 3     dog  M66  NaN            <==
    # 4     fly    F   61
    # 5     ant    F   21

Is there a way to make read_csv to not add lines with less columns delimiters than the main lines?有没有办法让read_csv不添加列分隔符少于主行的行?

Note: I'm in a context of loading real big data files (eg hundred of millions of lines, so the idea is not to propose any upfront grep/sed/awk processing but to take benefit of fast read_csv bulk_load)注意:我正在加载真正的大数据文件(例如数亿行,所以这个想法不是提出任何前期 grep/sed/awk 处理,而是利用快速read_csv bulk_load)

pd.read_csv() is a very nice function that performs a well-defined computation, but you desire a slightly different computation. pd.read_csv()是一个非常好的 function ,它执行定义明确的计算,但您需要稍微不同的计算。 You wish to filter out all rows containing fewer than K fields.您希望过滤掉包含少于 K 个字段的所有行。

the idea is not to propose any upfront grep / sed / awk processing这个想法是不提出任何前期 grep / sed / awk 处理

You have rather constrained the solution space.您相当限制了解决方案的空间。 Apparently speed (elapsed time) or power efficiency (watts dissipated) are motivating concerns.显然,速度(经过的时间)或功率效率(耗散的瓦数)是令人担忧的问题。

You correctly observe that grep is quite fast and would be a natural pre-processing stage.您正确地观察到grep非常快,并且将是一个自然的预处理阶段。 One could store its filtered output to a temp file which we feed to.read_csv(), potentially costing extra disk I/O.可以将其过滤后的 output 存储到我们提供给.read_csv() 的临时文件中,这可能会花费额外的磁盘 I/O。 A better solution would be to pipe its output using the subprocess library .更好的解决方案是使用子进程库pipe 其 output 。

The original post mentions no grep timing results, so it is unclear if overhead due to an extra child process has been shown to be "too slow".原始帖子没有提到grep时序结果,因此尚不清楚由于额外的子进程导致的开销是否已被证明“太慢”。 There's no throughput specification of N rows / second, so it's unclear how this or any competing proposal should be evaluated.没有 N 行/秒的吞吐量规范,因此不清楚应该如何评估这个或任何竞争提案。

Note that .read_csv() accepts a file-like object, which could be a python generator that inspects each row and only yield s suitable rows.请注意, .read_csv()接受类似文件的 object,它可能是一个 python 生成器,它检查每一行并只yield s 合适的行。

Given that you're gung ho on calling.read_csv(), a function which doesn't quite compute what you want, it seems there's little for it but to post-process its output and hope for the best.鉴于您对 call.read_csv() 很感兴趣,这是一个 function 并不能完全计算出您想要的东西,似乎没有什么可以做的,但要对其 output 进行后处理并希望最好。

Filtering out all NaN s might do, but that's a little on the drastic side.过滤掉所有NaN可能会做,但这有点过激。 There is some buggy generating process that produces "short" rows with fewer than K fields.有一些错误的生成过程会产生少于 K 个字段的“短”行。 If you know the minimum number of fields it's guaranteed to produce, you could at least do appropriate column-wise filtering to discard short rows.如果您知道它保证产生的最小字段数,您至少可以进行适当的按列过滤以丢弃短行。 Then you get to preserve true NaN s in the first several columns.然后你可以在前几列中保留真正的NaN Good luck!祝你好运!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM