简体   繁体   中英

Read csv into dataframe by ignoring “LF” and “CR LF” within field enclosure quotes (“ ”)

I have a data file in which fields are enclosed within double quotes and field separator like below:

field enclosure = "<field_value>"
sep = ||@@##

So of the field values have text within quotes that have 'LF' and 'CR LF' line separators which are causing for the next lines to be printed on a new line - which may be misinterpreted as a new record, when in reality, it a part of one record, has the lines not been broken to shift to a new line.

example:

3||@@##14||@@##"2016-01-13 19:59:27"||@@##"2016-01-15 23:09:19"||@@##1162||@@##822||@@##1237||@@##\N||@@##"VHiujdfYshv"||@@##"---<LF>
...LF
"||@@##\N||@@##"2016-01-15 23:09:18"||@@##0||@@##1||@@##0||@@##0||@@##3||@@##1788||@@##\N||@@##205||@@##\N||@@##0||@@##\N||@@##\N||@@##\N||@@##\N||@@##\N||@@##\N||@@##1||@@##\N||@@##"251 Bgegf BHVcvytd Street<CR LF>
JHbsdbfh, RF 35214<CR LF>
<CR LF>
xyz@gmail.dhg.com<CR LF>
<CR LF>
@@##1788<LF>
4||@@##14||@@##"2016-01-25 22:08:53"||@@##"2016-02-15 20:32:08"||@@##1097||@@##933||@@##1262||@@##\N||@@##"VHiujdfYshv"||@@##"--- <LF>
...<LF>

Please note that the LF and CR LF actually show up without the angle brackets, which is, probably, a given, but I am mentioning it for absolute clarity. Below is a snip of how that looks on a notepad++ file. Also, note that my data consists of '||@@##' as a field separator, with '\N' for the na_values.

Below is how I am reading this file so far. I tried to use 'quotechar' and 'quoting' params from the pd.read_csv, but that uses a C parser, which separator uses a Python parser, so python parser is overriding. How do I read this file <process it before reading as a CSV, or use some regex while reading a CSV file? Please help.

df =  pd.read_csv(z.open(filename), 
                              encoding = 'utf8',
                              header=None,
                              sep='\|\|@@##',
                              na_values='\\N',
                             engine = 'python')

在此处输入图像描述

I have a data file in which fields are enclosed within double quotes and field separator like below:

field enclosure = "<field_value>"
sep = ||@@##

So of the field values have text within quotes that have 'LF' and 'CR LF' line separators which are causing for the next lines to be printed on a new line - which may be misinterpreted as a new record, when in reality, it a part of one record, has the lines not been broken to shift to a new line.

example:

3||@@##14||@@##"2016-01-13 19:59:27"||@@##"2016-01-15 23:09:19"||@@##1162||@@##822||@@##1237||@@##\N||@@##"VHiujdfYshv"||@@##"---<LF>
...LF
"||@@##\N||@@##"2016-01-15 23:09:18"||@@##0||@@##1||@@##0||@@##0||@@##3||@@##1788||@@##\N||@@##205||@@##\N||@@##0||@@##\N||@@##\N||@@##\N||@@##\N||@@##\N||@@##\N||@@##1||@@##\N||@@##"251 Bgegf BHVcvytd Street<CR LF>
JHbsdbfh, RF 35214<CR LF>
<CR LF>
xyz@gmail.dhg.com<CR LF>
<CR LF>
@@##1788<LF>
4||@@##14||@@##"2016-01-25 22:08:53"||@@##"2016-02-15 20:32:08"||@@##1097||@@##933||@@##1262||@@##\N||@@##"VHiujdfYshv"||@@##"--- <LF>
...<LF>

Please note that the LF and CR LF actually show up without the angle brackets, which is, probably, a given, but I am mentioning it for absolute clarity. Below is a snip of how that looks on a notepad++ file. Also, note that my data consists of '||@@##' as a field separator, with '\N' for the na_values.

Below is how I am reading this file so far. I tried to use 'quotechar' and 'quoting' params from the pd.read_csv, but that uses a C parser, which separator uses a Python parser, so python parser is overriding. How do I read this file <process it before reading as a CSV, or use some regex while reading a CSV file? Please help.

df =  pd.read_csv(z.open(filename), 
                              encoding = 'utf8',
                              header=None,
                              sep='\|\|@@##',
                              na_values='\\N',
                             engine = 'python')

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM