简体   繁体   中英

Wrangling a data frame in Pandas (Python)

I have the following data in a csv file:

from StringIO import StringIO
import pandas as pd

the_data = """
ABC,2016-6-9 0:00,95,{'//Purple': [115L], '//Yellow': [403L], '//Blue': [16L], '//White-XYZ': [0L]}
ABC,2016-6-10 0:00,0,{'//Purple': [219L], '//Yellow': [381L], '//Blue': [90L], '//White-XYZ': [0L]}
ABC,2016-6-11 0:00,0,{'//Purple': [817L], '//Yellow': [21L], '//Blue': [31L], '//White-XYZ': [0L]}
ABC,2016-6-12 0:00,0,{'//Purple': [80L], '//Yellow': [2011L], '//Blue': [8888L], '//White-XYZ': [0L]}
ABC,2016-6-13 0:00,0,{'//Purple': [32L], '//Yellow': [15L], '//Blue': [4L], '//White-XYZ': [0L]}
DEF,2016-6-16 0:00,0,{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [3L]}
DEF,2016-6-17 0:00,0,{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [0L]}
DEF,2016-6-18 0:00,0,{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [7L]}
DEF,2016-6-19 0:00,0,{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [14L]}
DEF,2016-6-20 0:00,0,{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [21L]}
"""

I read the data into a Pandas data frame, as follows:

df = pd.read_csv(StringIO(the_data), sep=',', header=None)

The 'Company' and 'Date' fields will never change.

However, the 'keys' inside the curly braces (eg "//PurpleCar" , "//YellowCar" , "//BlueCar" , "//WhiteCar" , "//BlackCar" , "//BlackCar" and "NPO-GreenCar" ) are not static . They can (and will) change frequently.

(note: another script that I have outputs a dictionary and 'creates' this text file, hence this data structure)

I'd like to get the data frame to appear as follows so that I can use Matplotlib to create visualizations:

   Company  Date       Purple   Yellow   Blue    White-XYZ   Black  Pink   NPO-Green  

0  ABC     2016-6-9    115      403      16      0            0     0      0
1  ABC     2016-6-10   219      381      90      0            0     0      0
2  ABC     2016-6-11   817      21       31      0            0     0      0
3  ABC     2016-6-12   80       2011     8888    0            0     0      0
4  ABC     2016-6-13   32       15       4       0            0     0      0
5  DEF     2016-6-16   32       0        0       0            15    4      3
6  DEF     2016-6-17   32       0        0       0            15    4      0
7  DEF     2016-6-18   32       0        0       0            15    4      7
8  DEF     2016-6-19   32       0        0       0            15    4      14
9  DEF     2016-6-20   32       0        0       0            15    4      21

The problems that I'm facing are:

a) moving the 'key' values up to the column headers

b) allowing the 'key' values to be dynamic (again, they can and will change)

c) removing the square braces ( '[' and ']' )

d) removing the double slashes ( '//' )

e) removing the "L" following the numerical value

Points 'c', 'd' and 'e' above can be addressed with the following issue (which is related):

How to remove curly braces, apostrophes and square brackets from dictionaries in a Pandas dataframe (Python)

It's points 'a' and 'b' that are the ones I'm struggling with.

Does anyone see a way to address these?

Thanks!

* UPDATE *

The data originally posted had a small mistake. Here is the data:

the_data = """
ABC,2016-6-9 0:00,95,"{'//Purple': [115L], '//Yellow': [403L], '//Blue': [16L], '//White-XYZ': [0L]}"
ABC,2016-6-10 0:00,0,"{'//Purple': [219L], '//Yellow': [381L], '//Blue': [90L], '//White-XYZ': [0L]}"
ABC,2016-6-11 0:00,0,"{'//Purple': [817L], '//Yellow': [21L], '//Blue': [31L], '//White-XYZ': [0L]}"
ABC,2016-6-12 0:00,0,"{'//Purple': [80L], '//Yellow': [2011L], '//Blue': [8888L], '//White-XYZ': [0L]}"
ABC,2016-6-13 0:00,0,"{'//Purple': [32L], '//Yellow': [15L], '//Blue': [4L], '//White-XYZ': [0L]}"
DEF,2016-6-16 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [3L]}"
DEF,2016-6-17 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [0L]}"
DEF,2016-6-18 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [7L]}"
DEF,2016-6-19 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [14L]}"
DEF,2016-6-20 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [21L]}"
"""

The difference between this data and the original data is the apostrophes (") before the opening curly brace ( "{" ) and after the closing curly brace ( "}" ).

I really don't think this pandas can do much for you here. You're data is very obtuse and seems to me to be best dealt with using regular expressions. Here's my solution:

import re

static_cols = []
dynamic_cols = []
for line in the_data.splitlines():
    if line == '':
        continue

    # deal with static columns
    x = line.split(',')
    company, date, other = x[0:3]
    keys = ['Company', 'Date', 'Other']
    values = [company, date, other]
    d = {i: j for i, j in zip(keys, values)}
    static_cols.append(d)

    # deal with dynamic columns
    keys = re.findall(r'(?<=//)[^\']*', line)
    values = re.findall(r'\d+(?=L)', line)
    d = {i: j for i, j in zip(keys, values)}
    dynamic_cols.append(d)

df1 = pd.DataFrame(static_cols)
df2 = pd.DataFrame(dynamic_cols)
df = pd.concat([df1, df2], axis=1)

And the output:

在此处输入图片说明

Also, your data had an extra column after the date I wasn't sure how to deal with so I just called it 'Other'. It wasn't included in your output, so you can easily remove it if you want as well.

Consider converting the dictionary column values as Python dictionaries using ast.literal_eval() and then cast them as individual dataframes for final merge with original dataframe:

from io import StringIO
import pandas as pd

import ast
...

df = pd.read_csv(StringIO(the_data), header=None, 
                 names=['Company', 'Date', 'Value', 'Dicts'])

dfList = []
for i in df['Dicts'].tolist():
    result = ast.literal_eval(i.replace('L]', ']'))            
    result = {k.replace('//',''):v for k,v in result.items()}
    temp = pd.DataFrame(result)
    dfList.append(temp)

dictdf = pd.concat(dfList).reset_index(drop=True)
df = pd.merge(df, dictdf, left_index=True, right_index=True).drop(['Dicts'], axis=1)
print(df)

#   Company            Date  Value  Black    Blue  NPO-Green  Pink  Purple  White-XYZ  Yellow
# 0     ABC   2016-6-9 0:00     95    NaN    16.0        NaN   NaN     115        0.0   403.0
# 1     ABC  2016-6-10 0:00      0    NaN    90.0        NaN   NaN     219        0.0   381.0
# 2     ABC  2016-6-11 0:00      0    NaN    31.0        NaN   NaN     817        0.0    21.0
# 3     ABC  2016-6-12 0:00      0    NaN  8888.0        NaN   NaN      80        0.0  2011.0
# 4     ABC  2016-6-13 0:00      0    NaN     4.0        NaN   NaN      32        0.0    15.0
# 5     DEF  2016-6-16 0:00      0   15.0     NaN        3.0   4.0      32        NaN     NaN
# 6     DEF  2016-6-17 0:00      0   15.0     NaN        0.0   4.0      32        NaN     NaN
# 7     DEF  2016-6-18 0:00      0   15.0     NaN        7.0   4.0      32        NaN     NaN
# 8     DEF  2016-6-19 0:00      0   15.0     NaN       14.0   4.0      32        NaN     NaN
# 9     DEF  2016-6-20 0:00      0   15.0     NaN       21.0   4.0      32        NaN     NaN

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM