I have a data file in which fields are enclosed within double quotes and field separator like below:
field enclosure = "<field_value>"
sep = ||@@##
So of the field values have text within quotes that have 'LF' and 'CR LF' line separators which are causing for the next lines to be printed on a new line - which may be misinterpreted as a new record, when in reality, it a part of one record, has the lines not been broken to shift to a new line.
example:
3||@@##14||@@##"2016-01-13 19:59:27"||@@##"2016-01-15 23:09:19"||@@##1162||@@##822||@@##1237||@@##\N||@@##"VHiujdfYshv"||@@##"---<LF>
...LF
"||@@##\N||@@##"2016-01-15 23:09:18"||@@##0||@@##1||@@##0||@@##0||@@##3||@@##1788||@@##\N||@@##205||@@##\N||@@##0||@@##\N||@@##\N||@@##\N||@@##\N||@@##\N||@@##\N||@@##1||@@##\N||@@##"251 Bgegf BHVcvytd Street<CR LF>
JHbsdbfh, RF 35214<CR LF>
<CR LF>
xyz@gmail.dhg.com<CR LF>
<CR LF>
@@##1788<LF>
4||@@##14||@@##"2016-01-25 22:08:53"||@@##"2016-02-15 20:32:08"||@@##1097||@@##933||@@##1262||@@##\N||@@##"VHiujdfYshv"||@@##"--- <LF>
...<LF>
Please note that the LF
and CR LF
actually show up without the angle brackets, which is, probably, a given, but I am mentioning it for absolute clarity. Below is a snip of how that looks on a notepad++ file. Also, note that my data consists of '||@@##' as a field separator, with '\N' for the na_values.
Below is how I am reading this file so far. I tried to use 'quotechar' and 'quoting' params from the pd.read_csv, but that uses a C parser, which separator uses a Python parser, so python parser is overriding. How do I read this file <process it before reading as a CSV, or use some regex while reading a CSV file? Please help.
df = pd.read_csv(z.open(filename),
encoding = 'utf8',
header=None,
sep='\|\|@@##',
na_values='\\N',
engine = 'python')
I have a data file in which fields are enclosed within double quotes and field separator like below:
field enclosure = "<field_value>"
sep = ||@@##
So of the field values have text within quotes that have 'LF' and 'CR LF' line separators which are causing for the next lines to be printed on a new line - which may be misinterpreted as a new record, when in reality, it a part of one record, has the lines not been broken to shift to a new line.
example:
3||@@##14||@@##"2016-01-13 19:59:27"||@@##"2016-01-15 23:09:19"||@@##1162||@@##822||@@##1237||@@##\N||@@##"VHiujdfYshv"||@@##"---<LF>
...LF
"||@@##\N||@@##"2016-01-15 23:09:18"||@@##0||@@##1||@@##0||@@##0||@@##3||@@##1788||@@##\N||@@##205||@@##\N||@@##0||@@##\N||@@##\N||@@##\N||@@##\N||@@##\N||@@##\N||@@##1||@@##\N||@@##"251 Bgegf BHVcvytd Street<CR LF>
JHbsdbfh, RF 35214<CR LF>
<CR LF>
xyz@gmail.dhg.com<CR LF>
<CR LF>
@@##1788<LF>
4||@@##14||@@##"2016-01-25 22:08:53"||@@##"2016-02-15 20:32:08"||@@##1097||@@##933||@@##1262||@@##\N||@@##"VHiujdfYshv"||@@##"--- <LF>
...<LF>
Please note that the LF
and CR LF
actually show up without the angle brackets, which is, probably, a given, but I am mentioning it for absolute clarity. Below is a snip of how that looks on a notepad++ file. Also, note that my data consists of '||@@##' as a field separator, with '\N' for the na_values.
Below is how I am reading this file so far. I tried to use 'quotechar' and 'quoting' params from the pd.read_csv, but that uses a C parser, which separator uses a Python parser, so python parser is overriding. How do I read this file <process it before reading as a CSV, or use some regex while reading a CSV file? Please help.
df = pd.read_csv(z.open(filename),
encoding = 'utf8',
header=None,
sep='\|\|@@##',
na_values='\\N',
engine = 'python')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.