I have the following data in a csv file:
from StringIO import StringIO
import pandas as pd
the_data = """
ABC,2016-6-9 0:00,95,{'//Purple': [115L], '//Yellow': [403L], '//Blue': [16L], '//White-XYZ': [0L]}
ABC,2016-6-10 0:00,0,{'//Purple': [219L], '//Yellow': [381L], '//Blue': [90L], '//White-XYZ': [0L]}
ABC,2016-6-11 0:00,0,{'//Purple': [817L], '//Yellow': [21L], '//Blue': [31L], '//White-XYZ': [0L]}
ABC,2016-6-12 0:00,0,{'//Purple': [80L], '//Yellow': [2011L], '//Blue': [8888L], '//White-XYZ': [0L]}
ABC,2016-6-13 0:00,0,{'//Purple': [32L], '//Yellow': [15L], '//Blue': [4L], '//White-XYZ': [0L]}
DEF,2016-6-16 0:00,0,{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [3L]}
DEF,2016-6-17 0:00,0,{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [0L]}
DEF,2016-6-18 0:00,0,{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [7L]}
DEF,2016-6-19 0:00,0,{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [14L]}
DEF,2016-6-20 0:00,0,{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [21L]}
"""
I read the data into a Pandas data frame, as follows:
df = pd.read_csv(StringIO(the_data), sep=',', header=None)
The 'Company' and 'Date' fields will never change.
However, the 'keys' inside the curly braces (eg "//PurpleCar"
, "//YellowCar"
, "//BlueCar"
, "//WhiteCar"
, "//BlackCar"
, "//BlackCar"
and "NPO-GreenCar"
) are not static . They can (and will) change frequently.
(note: another script that I have outputs a dictionary and 'creates' this text file, hence this data structure)
I'd like to get the data frame to appear as follows so that I can use Matplotlib to create visualizations:
Company Date Purple Yellow Blue White-XYZ Black Pink NPO-Green
0 ABC 2016-6-9 115 403 16 0 0 0 0
1 ABC 2016-6-10 219 381 90 0 0 0 0
2 ABC 2016-6-11 817 21 31 0 0 0 0
3 ABC 2016-6-12 80 2011 8888 0 0 0 0
4 ABC 2016-6-13 32 15 4 0 0 0 0
5 DEF 2016-6-16 32 0 0 0 15 4 3
6 DEF 2016-6-17 32 0 0 0 15 4 0
7 DEF 2016-6-18 32 0 0 0 15 4 7
8 DEF 2016-6-19 32 0 0 0 15 4 14
9 DEF 2016-6-20 32 0 0 0 15 4 21
The problems that I'm facing are:
a) moving the 'key' values up to the column headers
b) allowing the 'key' values to be dynamic (again, they can and will change)
c) removing the square braces ( '['
and ']'
)
d) removing the double slashes ( '//'
)
e) removing the "L" following the numerical value
Points 'c', 'd' and 'e' above can be addressed with the following issue (which is related):
It's points 'a' and 'b' that are the ones I'm struggling with.
Does anyone see a way to address these?
Thanks!
* UPDATE *
The data originally posted had a small mistake. Here is the data:
the_data = """
ABC,2016-6-9 0:00,95,"{'//Purple': [115L], '//Yellow': [403L], '//Blue': [16L], '//White-XYZ': [0L]}"
ABC,2016-6-10 0:00,0,"{'//Purple': [219L], '//Yellow': [381L], '//Blue': [90L], '//White-XYZ': [0L]}"
ABC,2016-6-11 0:00,0,"{'//Purple': [817L], '//Yellow': [21L], '//Blue': [31L], '//White-XYZ': [0L]}"
ABC,2016-6-12 0:00,0,"{'//Purple': [80L], '//Yellow': [2011L], '//Blue': [8888L], '//White-XYZ': [0L]}"
ABC,2016-6-13 0:00,0,"{'//Purple': [32L], '//Yellow': [15L], '//Blue': [4L], '//White-XYZ': [0L]}"
DEF,2016-6-16 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [3L]}"
DEF,2016-6-17 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [0L]}"
DEF,2016-6-18 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [7L]}"
DEF,2016-6-19 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [14L]}"
DEF,2016-6-20 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [21L]}"
"""
The difference between this data and the original data is the apostrophes (")
before the opening curly brace ( "{"
) and after the closing curly brace ( "}"
).
I really don't think this pandas can do much for you here. You're data is very obtuse and seems to me to be best dealt with using regular expressions. Here's my solution:
import re
static_cols = []
dynamic_cols = []
for line in the_data.splitlines():
if line == '':
continue
# deal with static columns
x = line.split(',')
company, date, other = x[0:3]
keys = ['Company', 'Date', 'Other']
values = [company, date, other]
d = {i: j for i, j in zip(keys, values)}
static_cols.append(d)
# deal with dynamic columns
keys = re.findall(r'(?<=//)[^\']*', line)
values = re.findall(r'\d+(?=L)', line)
d = {i: j for i, j in zip(keys, values)}
dynamic_cols.append(d)
df1 = pd.DataFrame(static_cols)
df2 = pd.DataFrame(dynamic_cols)
df = pd.concat([df1, df2], axis=1)
And the output:
Also, your data had an extra column after the date I wasn't sure how to deal with so I just called it 'Other'. It wasn't included in your output, so you can easily remove it if you want as well.
Consider converting the dictionary column values as Python dictionaries using ast.literal_eval()
and then cast them as individual dataframes for final merge with original dataframe:
from io import StringIO
import pandas as pd
import ast
...
df = pd.read_csv(StringIO(the_data), header=None,
names=['Company', 'Date', 'Value', 'Dicts'])
dfList = []
for i in df['Dicts'].tolist():
result = ast.literal_eval(i.replace('L]', ']'))
result = {k.replace('//',''):v for k,v in result.items()}
temp = pd.DataFrame(result)
dfList.append(temp)
dictdf = pd.concat(dfList).reset_index(drop=True)
df = pd.merge(df, dictdf, left_index=True, right_index=True).drop(['Dicts'], axis=1)
print(df)
# Company Date Value Black Blue NPO-Green Pink Purple White-XYZ Yellow
# 0 ABC 2016-6-9 0:00 95 NaN 16.0 NaN NaN 115 0.0 403.0
# 1 ABC 2016-6-10 0:00 0 NaN 90.0 NaN NaN 219 0.0 381.0
# 2 ABC 2016-6-11 0:00 0 NaN 31.0 NaN NaN 817 0.0 21.0
# 3 ABC 2016-6-12 0:00 0 NaN 8888.0 NaN NaN 80 0.0 2011.0
# 4 ABC 2016-6-13 0:00 0 NaN 4.0 NaN NaN 32 0.0 15.0
# 5 DEF 2016-6-16 0:00 0 15.0 NaN 3.0 4.0 32 NaN NaN
# 6 DEF 2016-6-17 0:00 0 15.0 NaN 0.0 4.0 32 NaN NaN
# 7 DEF 2016-6-18 0:00 0 15.0 NaN 7.0 4.0 32 NaN NaN
# 8 DEF 2016-6-19 0:00 0 15.0 NaN 14.0 4.0 32 NaN NaN
# 9 DEF 2016-6-20 0:00 0 15.0 NaN 21.0 4.0 32 NaN NaN
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.