simple code
import pandas as pd
df = pd.read_csv('problematic.csv', sep='|',quotechar='"')
pandas_df = df.replace({r'\\r': ''}, regex=True)
pandas_df = pandas_df.replace({r'\\n': ''}, regex=True)
print(pandas_df.head())
i have input INPUT -
ID | NAME | VILLAGE | PENSION -----HEADER
001 | XYZ | RAMG | 1500 -----ROW1
002 | DINAL
SHAMSUDH
DHON
| SHIWA | 2090
EXPECTED OUTPUT
ID | NAME | VILLAGE | PENSION
001 | XYZ | RAMG | 1500
002 | DINAL SHAMSUDH DHON | SHIWA | 2090
I suggest removing the redundant newline characters prior to reading the csv in pandas. You could do so by with by opening the file, reading it with readlines()
, which will create a list of lines. Then you can remove the newline characters from every item in the list that does not contain three |
characters:
from io import StringIO
import pandas as pd
with open('problematic.csv') as f:
text = f.readlines()
text = ' '.join([i.replace('\n', '').strip() if i.count('|') <3 else i for i in text])
df = pd.read_csv(StringIO(text), sep='|',quotechar='"')
Output:
ID | NAME | VILLAGE | PENSION | |
---|---|---|---|---|
0 | 1 | XYZ | RAMG | 1500 |
1 | 2 | DINAL SHAMSUDH DHON | SHIWA | 2090 |
Note that this example assumes that -----HEADER
is not part of the csv file. If it is you can filter it out with replace()
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.