I have a text file formatted and I can't figure out how to get read_csv in pandas to correctly read it. The regex expression works directly but not in pandas.read_csv.
By default, I think this should work with the default quoting=0 and without regex
import pandas as pd
from io import StringIO
s = " \"Random Text\" 1234.00 5678.00 9876.00 1 Z5 2 0 1 1.500 35.3 1.00 389 0.096000 10.00 15000.0 0.102 0.199 0.040 1 0 0 2900 N/A N/A N/A\n"
print(s)
df = pd.read_csv(StringIO(s), engine='python', header=None, delim_whitespace=True, quoting=0)
display(df)
but this produces "Random
and Text"
in seperate columns
Attempt 2 with regex:
sep_regex = '\s+(?=([^\"]*\"[^\"]*\")*[^\"]*$)' # regex to find spaces except within quotes
df = pd.read_csv(StringIO(s), header=None, sep=sep_regex, engine='python', warn_bad_lines=True)
display(df)
This correctly keeps the quoted text togther but puts NaN between each column.
This should work:
df = pd.read_csv(StringIO(s), header=None, sep=r'\s+', quotechar='"')
print(df)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
0 Random Text 1234.0 5678.0 9876.0 1 Z5 2 0 1 1.5 35.3 1.0 389 0.096 10.0 15000.0 0.102 0.199 0.04 1 0 0 2900 NaN NaN NaN
This worked for me:
df = pd.read_csv(StringIO(s), sep=None, engine='python',
header=None, quoting=0, skipinitialspace=True)
Output:
0 1 2 3 4 5 6 7 8 9 ... 16 17 18 19 20 21 22 23 24 25
0 Random Text 1234.0 5678.0 9876.0 1 Z5 2 0 1 1.5 ... 0.102 0.199 0.04 1 0 0 2900 NaN NaN NaN
[1 rows x 26 columns]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.