How to escape delimiter inside the value

Question

I am facing an issue with escaping the delimiter inside the value. My code reads a PSV file. Of late I am getting the delimiter | (with the escape character \ ) in one of the columns value. Because of this issue, records are being dropped. Please see the issue below.

Records

abcd|1234|222\|3344|count|33 

abcd|1234|111\|5566|count|44

In this file the delimiter is | and valid values for 3rd column is 222|3344 and 111|5566 respectively.

I am using the following syntax to read the file.

df_input=spark.read.format("csv").option("delimiter","|")..option("escape", "\\").load(var_files_path +"/*.psv" , schema=input_schema)```

When I read, a few records were skipped because of the delimiter inside the value. Can you please guide me to solve this issue. TIA.

Answer 1

Assuming pyspark uses Python's csv module , then the default quotechar is " , which gives a clue about how Excel defined quoting in csv: Surround a value string with the quote character. It's not an escape sequence to prefix a single character.

Try this in a Python console:

>>> import csv
>>> import io

>>> i = io.StringIO('abcd|1234|"222|3344"|count|33')
>>> r = csv.reader(i, delimiter='|')
>>> r.__next__()
['abcd', '1234', '222|3344', 'count', '33']

>>> i = io.StringIO(r'abcd|1234|\222|3344\|count|33')
>>> r = csv.reader(i, delimiter='|', quotechar='\\')
>>> r.__next__()
['abcd', '1234', '222|3344', 'count', '33']

The PSV format is typically used for cases where the pipe character would not appear in the data, so no quoting would be needed. Maybe tab-separated values (TSV) would be easier in your case.

Answer 2

This solution used rdd

rdd1  = rdd.map(lambda x: x.replace("\\|", ""))

I myself has used rdd with Python regex

import re
raw_string = r"(\\\|)"
rdd_cleaned = rdd.map(lambda x: re.sub(raw_string, "", x))

How to escape delimiter inside the value

Question

2 answers

solution1
0 2020-09-22 03:34:27

solution2
-1 2021-10-27 20:30:10

How to escape delimiter inside the value

Question

2 answers

solution1 0 2020-09-22 03:34:27

solution2 -1 2021-10-27 20:30:10

solution1
0 2020-09-22 03:34:27

solution2
-1 2021-10-27 20:30:10