How to use Pandas to read CSV file with delimiter existing in the last field?

Question

I have raw data in the following format:

JobID,Publish,Expire,TitleAndDetail
7428,17/12/2006 2:00:00 PM,28/01/2007 2:00:00 PM,Project Engineer - Mechanical      Looking,.....,....
7429,9/03/2006 2:00:00 PM,27/02/2007 2:00:00 PM,Supply Teacher      The job is,.....,.....

As you can see the delimiter is comma, however the last column is a chunk of text with commas within. I am using pandas' read_csv function to read this CSV file. However in pandas dataframe the text parts after the 4th comma in each line are lost.

raw_data = pd.read_csv(r"/ABC/JobDetails.csv",
                       names=['JobID', 'Publish', 'Expire', 'TitleAndDetail'],
                       header=None
                       )

If using string.split() function, I can specify a maxsplit parameter which allows me to keep all the content in the last column even if there're many commas. Is there similar functionality in Pandas?

Answer 1

So here's a bit of a hack you could try:

raw_data = pd.read_csv(r"/ABC/JobDetails.csv",
                       squeeze=True,
                       sep="\a"
                       )

This should give you a Series by ignoring the ","s

Then you can do:

df = raw_data.str.split(",", n=4, expand=True)
df.columns = ['JobID', 'Publish', 'Expire', 'TitleAndDetail']

That should split into 4 columns and rename

Answer 2

You can do in this way:

with open("file.csv", "r") as fp:
    reader = csv.reader(fp, delimiter=",")
    rows = [x[:3] + [','.join(x[3:])] for x in reader]
    df = pd.DataFrame(rows)
    df.columns = df.iloc[0]
    df = df.reindex(df.index.drop(0))
    print df

Answer 3

Read the file manually and then create the dataframe:

rows = []

with open('somefile.csv') as f:
  keys = next(f).split(',')
  for line in f:
     rows.append(dict(zip(keys, line.split(',', 3))))

df = pd.DataFrame(rows)

.split takes an optional parameter to limit the number of times it splits over the delimiter. Passing 3 means it ignores the commas in your last field:

>>> s.split(',', 3)
['7428',
 '17/12/2006 2:00:00 PM',
 '28/01/2007 2:00:00 PM',
 'Project Engineer - Mechanical      Looking,.....,....']

Next, we create a dictionary with the keys from the header row and the values from the data rows:

>>> f = 'JobID,Publish,Expire,TitleAndDetail'.split(',')
>>> dict(zip(f, s.split(',', 3)))
{'JobID': '7428',
 'Publish': '17/12/2006 2:00:00 PM',
 'Expire': '28/01/2007 2:00:00 PM',
 'TitleAndDetail': 'Project Engineer - Mechanical      Looking,.....,....'}

Finally, we make a list of these dictionaries (in rows ), and pass this as an argument to create our data frame object.

How to use Pandas to read CSV file with delimiter existing in the last field?

Question

3 answers

solution1
0 2018-09-26 06:43:45

solution2
0 2018-09-26 06:53:29

solution3
0 ACCPTED 2018-09-26 06:56:01

How to use Pandas to read CSV file with delimiter existing in the last field?

Question

3 answers

solution1 0 2018-09-26 06:43:45

solution2 0 2018-09-26 06:53:29

solution3 0 ACCPTED 2018-09-26 06:56:01

solution1
0 2018-09-26 06:43:45

solution2
0 2018-09-26 06:53:29

solution3
0 ACCPTED 2018-09-26 06:56:01