I have raw data in the following format:
JobID,Publish,Expire,TitleAndDetail
7428,17/12/2006 2:00:00 PM,28/01/2007 2:00:00 PM,Project Engineer - Mechanical Looking,.....,....
7429,9/03/2006 2:00:00 PM,27/02/2007 2:00:00 PM,Supply Teacher The job is,.....,.....
As you can see the delimiter is comma, however the last column is a chunk of text with commas within. I am using pandas' read_csv
function to read this CSV file. However in pandas dataframe the text parts after the 4th comma in each line are lost.
raw_data = pd.read_csv(r"/ABC/JobDetails.csv",
names=['JobID', 'Publish', 'Expire', 'TitleAndDetail'],
header=None
)
If using string.split()
function, I can specify a maxsplit
parameter which allows me to keep all the content in the last column even if there're many commas. Is there similar functionality in Pandas?
So here's a bit of a hack you could try:
raw_data = pd.read_csv(r"/ABC/JobDetails.csv",
squeeze=True,
sep="\a"
)
This should give you a Series by ignoring the ","s
Then you can do:
df = raw_data.str.split(",", n=4, expand=True)
df.columns = ['JobID', 'Publish', 'Expire', 'TitleAndDetail']
That should split into 4 columns and rename
You can do in this way:
with open("file.csv", "r") as fp:
reader = csv.reader(fp, delimiter=",")
rows = [x[:3] + [','.join(x[3:])] for x in reader]
df = pd.DataFrame(rows)
df.columns = df.iloc[0]
df = df.reindex(df.index.drop(0))
print df
Read the file manually and then create the dataframe:
rows = []
with open('somefile.csv') as f:
keys = next(f).split(',')
for line in f:
rows.append(dict(zip(keys, line.split(',', 3))))
df = pd.DataFrame(rows)
.split
takes an optional parameter to limit the number of times it splits over the delimiter. Passing 3 means it ignores the commas in your last field:
>>> s.split(',', 3)
['7428',
'17/12/2006 2:00:00 PM',
'28/01/2007 2:00:00 PM',
'Project Engineer - Mechanical Looking,.....,....']
Next, we create a dictionary with the keys from the header row and the values from the data rows:
>>> f = 'JobID,Publish,Expire,TitleAndDetail'.split(',')
>>> dict(zip(f, s.split(',', 3)))
{'JobID': '7428',
'Publish': '17/12/2006 2:00:00 PM',
'Expire': '28/01/2007 2:00:00 PM',
'TitleAndDetail': 'Project Engineer - Mechanical Looking,.....,....'}
Finally, we make a list of these dictionaries (in rows
), and pass this as an argument to create our data frame object.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.