简体   繁体   English

Pandas 在 read_csv 中跳过行,我可以将这些记录到变量/日志文件中吗

[英]Pandas skipping lines when in read_csv, can I record these to variable/log file

I've seen similar questions on here but nothing that is quite what I want to do.我在这里看到了类似的问题,但没有什么是我想做的。

I'm reading in a tsv/csv file using我正在使用 tsv/csv 文件读取

        try:
            dataframe = pd.read_csv(
                filepath_or_buffer=filename_or_obj,
                sep='\t',
                encoding='utf-8',
                skip_blank_lines=True,
                error_bad_lines=False,
                warn_bad_lines=True,
                dtype=data_type_dict,
                engine='python',
                quoting=csv.QUOTE_NONE
            )
        except UnicodeDecodeError:
            dataframe = pd.read_csv(
                filepath_or_buffer=exception_filename_or_obj,
                sep='\t',
                encoding='latin-1',
                skip_blank_lines=True,
                error_bad_lines=False,
                warn_bad_lines=True,
                dtype=data_type_dict,
                engine='python',
                quoting=csv.QUOTE_NONE
            )

I have clearly defined headers within the file but sometimes I see that the file has unexpected additional columns and get the following messages in the console我在文件中明确定义了标题,但有时我看到文件有意外的附加列,并在控制台中收到以下消息

Skipping line 251643: Expected 20 fields in line 251643, saw 21

This is fine for my process, I would just like to know a way that I can record these messages or lines to either a dataframe or log file so that I know what lines have been skipped.这对我的过程很好,我只想知道一种方法可以将这些消息或行记录到 dataframe 或日志文件中,以便我知道哪些行已被跳过。 Due to the fact that the files can be submitted by anyone and it's an issue with formatting, I'm not interested in fixing the message, just recording out the line numbers that fail由于任何人都可以提交文件并且这是格式问题,我对修复消息不感兴趣,只是记录失败的行号

Massive thanks in advance:)提前非常感谢:)

Edit: include try except clause编辑:包括 try except 子句

To reproduce the issue, I used the following CSV file ( dummy.csv ):为了重现该问题,我使用了以下 CSV 文件( dummy.csv ):

F1,F2,F3
11,A,10.54
18,B,0.12,low
24,A,19.00
10,C,7.01,low
22,D,39.11,high
49,E,12.12

It may be noted that some lines have extra fields.可能会注意到某些行有额外的字段。

Since, we are using error_bad_lines=False , no errors/exceptions will be raised, so try-except is not the way ahead.由于我们使用的是error_bad_lines=False ,因此不会引发错误/异常,因此try-except不是前进的道路。 We need to redirect the stderr :我们需要重定向stderr

from contextlib import redirect_stderr
import pandas as pd
# import io

with open('error_messages.log', 'w') as h:
    # f = io.StringIO()
    # with redirect_stderr(f):
    with redirect_stderr(h):
        df = pd.read_csv(filepath_or_buffer='dummy.csv',
                sep=',',            # change it for your data
                encoding='latin-1',
                skip_blank_lines=True,
                error_bad_lines=False,
                # dtype=data_type_dict,
                engine='python',
                # quoting=csv.QUOTE_NONE
                )
        # h.write(f.getvalue())      # Write the error messages to log file

print(df)

The above code will write the messages to a log file!上面的代码会将消息写入日志文件!

Here is a sample output from the log file:这是来自日志文件的示例 output:

Skipping line 3: Expected 3 fields in line 3, saw 4
Skipping line 5: Expected 3 fields in line 5, saw 4
Skipping line 6: Expected 3 fields in line 6, saw 4

Update更新

Modified the code based on a suggestion (in comments below)根据建议修改了代码(在下面的评论中)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM