How to deal with multi-value lineterminators in pandas

Question

I have the \\x02\\n as a line terminator in a csv file I'm trying to parse. However, I cannot use two characters in pandas, it only allows one, for example:

>>> data = pd.read_csv(file, sep="\x01", lineterminator="\x02")
>>> data.loc[100].tolist()
['\n1475226000146', '1464606', 'Juvenile', '1', 'http://itunes.apple.com/artist/juvenile/id1464606?uo=5', '1']

Or:

data = pd.read_csv(file, sep="\x01", lineterminator="\n")
 >>> data.loc[100].tolist()
['1475226000146', '1464606', 'Juvenile', '1', 'http://itunes.apple.com/artist/juvenile/id1464606?uo=5', '1\x02']

Here we can see that the \\n hasn't been chopped off correctly. What would be the best way to read the csv file in pandas with the above separator?

Answer 1

As of v0.23, pandas does not support multi-character line-terminators. Your code currently returns:

s = "this\x01is\x01test\x02\nthis\x01is\x01test2\x02"
df = pd.read_csv(
    pd.compat.StringIO(s), sep="\x01", lineterminator="\x02", header=None)

df
        0   1      2
0    this  is   test
1  \nthis  is  test2

Your only option (as of now) is to remove the leading whitespace from the first column. You can do this with str.lstrip .

df.iloc[:, 0] = df.iloc[:, 0].str.lstrip()
# Alternatively,
# df.iloc[:, 0] = [s.lstrip() for s in df.iloc[:, 0]]

df

      0   1      2
0  this  is   test
1  this  is  test2

If you have to handle stripping of multiple other kinds of line-terminators (besides just the newline), you can pass a string of them:

line_terminators = ['\n', ...]
df.iloc[:, 0] = df.iloc[:, 0].str.lstrip(''.join(line_terminators))

How to deal with multi-value lineterminators in pandas

Question

1 answers

solution1
5 ACCPTED 2018-12-19 05:19:01

How to deal with multi-value lineterminators in pandas

Question

1 answers

solution1 5 ACCPTED 2018-12-19 05:19:01

solution1
5 ACCPTED 2018-12-19 05:19:01