简体   繁体   中英

How to use Pandas to read CSV file with delimiter existing in the last field?

I have raw data in the following format:

JobID,Publish,Expire,TitleAndDetail
7428,17/12/2006 2:00:00 PM,28/01/2007 2:00:00 PM,Project Engineer - Mechanical      Looking,.....,....
7429,9/03/2006 2:00:00 PM,27/02/2007 2:00:00 PM,Supply Teacher      The job is,.....,.....

As you can see the delimiter is comma, however the last column is a chunk of text with commas within. I am using pandas' read_csv function to read this CSV file. However in pandas dataframe the text parts after the 4th comma in each line are lost.

raw_data = pd.read_csv(r"/ABC/JobDetails.csv",
                       names=['JobID', 'Publish', 'Expire', 'TitleAndDetail'],
                       header=None
                       )

If using string.split() function, I can specify a maxsplit parameter which allows me to keep all the content in the last column even if there're many commas. Is there similar functionality in Pandas?

So here's a bit of a hack you could try:

raw_data = pd.read_csv(r"/ABC/JobDetails.csv",
                       squeeze=True,
                       sep="\a"
                       )

This should give you a Series by ignoring the ","s

Then you can do:

df = raw_data.str.split(",", n=4, expand=True)
df.columns = ['JobID', 'Publish', 'Expire', 'TitleAndDetail']

That should split into 4 columns and rename

You can do in this way:

with open("file.csv", "r") as fp:
    reader = csv.reader(fp, delimiter=",")
    rows = [x[:3] + [','.join(x[3:])] for x in reader]
    df = pd.DataFrame(rows)
    df.columns = df.iloc[0]
    df = df.reindex(df.index.drop(0))
    print df

Read the file manually and then create the dataframe:

rows = []

with open('somefile.csv') as f:
  keys = next(f).split(',')
  for line in f:
     rows.append(dict(zip(keys, line.split(',', 3))))

df = pd.DataFrame(rows)

.split takes an optional parameter to limit the number of times it splits over the delimiter. Passing 3 means it ignores the commas in your last field:

>>> s.split(',', 3)
['7428',
 '17/12/2006 2:00:00 PM',
 '28/01/2007 2:00:00 PM',
 'Project Engineer - Mechanical      Looking,.....,....']

Next, we create a dictionary with the keys from the header row and the values from the data rows:

>>> f = 'JobID,Publish,Expire,TitleAndDetail'.split(',')
>>> dict(zip(f, s.split(',', 3)))
{'JobID': '7428',
 'Publish': '17/12/2006 2:00:00 PM',
 'Expire': '28/01/2007 2:00:00 PM',
 'TitleAndDetail': 'Project Engineer - Mechanical      Looking,.....,....'}

Finally, we make a list of these dictionaries (in rows ), and pass this as an argument to create our data frame object.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM