简体   繁体   中英

How to deal with multi-value lineterminators in pandas

I have the \\x02\\n as a line terminator in a csv file I'm trying to parse. However, I cannot use two characters in pandas, it only allows one, for example:

>>> data = pd.read_csv(file, sep="\x01", lineterminator="\x02")
>>> data.loc[100].tolist()
['\n1475226000146', '1464606', 'Juvenile', '1', 'http://itunes.apple.com/artist/juvenile/id1464606?uo=5', '1']

Or:

data = pd.read_csv(file, sep="\x01", lineterminator="\n")
 >>> data.loc[100].tolist()
['1475226000146', '1464606', 'Juvenile', '1', 'http://itunes.apple.com/artist/juvenile/id1464606?uo=5', '1\x02']

Here we can see that the \\n hasn't been chopped off correctly. What would be the best way to read the csv file in pandas with the above separator?

As of v0.23, pandas does not support multi-character line-terminators. Your code currently returns:

s = "this\x01is\x01test\x02\nthis\x01is\x01test2\x02"
df = pd.read_csv(
    pd.compat.StringIO(s), sep="\x01", lineterminator="\x02", header=None)

df
        0   1      2
0    this  is   test
1  \nthis  is  test2

Your only option (as of now) is to remove the leading whitespace from the first column. You can do this with str.lstrip .

df.iloc[:, 0] = df.iloc[:, 0].str.lstrip()
# Alternatively,
# df.iloc[:, 0] = [s.lstrip() for s in df.iloc[:, 0]]

df

      0   1      2
0  this  is   test
1  this  is  test2

If you have to handle stripping of multiple other kinds of line-terminators (besides just the newline), you can pass a string of them:

line_terminators = ['\n', ...]
df.iloc[:, 0] = df.iloc[:, 0].str.lstrip(''.join(line_terminators))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM