简体   繁体   English

在Pandas中处理数据框(Python)

[英]Wrangling a data frame in Pandas (Python)

I have the following data in a csv file: 我在csv文件中包含以下数据:

from StringIO import StringIO
import pandas as pd

the_data = """
ABC,2016-6-9 0:00,95,{'//Purple': [115L], '//Yellow': [403L], '//Blue': [16L], '//White-XYZ': [0L]}
ABC,2016-6-10 0:00,0,{'//Purple': [219L], '//Yellow': [381L], '//Blue': [90L], '//White-XYZ': [0L]}
ABC,2016-6-11 0:00,0,{'//Purple': [817L], '//Yellow': [21L], '//Blue': [31L], '//White-XYZ': [0L]}
ABC,2016-6-12 0:00,0,{'//Purple': [80L], '//Yellow': [2011L], '//Blue': [8888L], '//White-XYZ': [0L]}
ABC,2016-6-13 0:00,0,{'//Purple': [32L], '//Yellow': [15L], '//Blue': [4L], '//White-XYZ': [0L]}
DEF,2016-6-16 0:00,0,{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [3L]}
DEF,2016-6-17 0:00,0,{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [0L]}
DEF,2016-6-18 0:00,0,{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [7L]}
DEF,2016-6-19 0:00,0,{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [14L]}
DEF,2016-6-20 0:00,0,{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [21L]}
"""

I read the data into a Pandas data frame, as follows: 我将数据读取到Pandas数据框中,如下所示:

df = pd.read_csv(StringIO(the_data), sep=',', header=None)

The 'Company' and 'Date' fields will never change. “公司”和“日期”字段将永远不会更改。

However, the 'keys' inside the curly braces (eg "//PurpleCar" , "//YellowCar" , "//BlueCar" , "//WhiteCar" , "//BlackCar" , "//BlackCar" and "NPO-GreenCar" ) are not static . 但是,花括号内的“键”(例如"//PurpleCar""//YellowCar""//BlueCar""//WhiteCar""//BlackCar""//BlackCar""NPO-GreenCar"不是静态的 They can (and will) change frequently. 他们可以(并且将)经常更改。

(note: another script that I have outputs a dictionary and 'creates' this text file, hence this data structure) (注意:我拥有的另一个脚本输出字典并“创建”此文本文件,因此是此数据结构)

I'd like to get the data frame to appear as follows so that I can use Matplotlib to create visualizations: 我想使数据框显示如下,以便可以使用Matplotlib创建可视化:

   Company  Date       Purple   Yellow   Blue    White-XYZ   Black  Pink   NPO-Green  

0  ABC     2016-6-9    115      403      16      0            0     0      0
1  ABC     2016-6-10   219      381      90      0            0     0      0
2  ABC     2016-6-11   817      21       31      0            0     0      0
3  ABC     2016-6-12   80       2011     8888    0            0     0      0
4  ABC     2016-6-13   32       15       4       0            0     0      0
5  DEF     2016-6-16   32       0        0       0            15    4      3
6  DEF     2016-6-17   32       0        0       0            15    4      0
7  DEF     2016-6-18   32       0        0       0            15    4      7
8  DEF     2016-6-19   32       0        0       0            15    4      14
9  DEF     2016-6-20   32       0        0       0            15    4      21

The problems that I'm facing are: 我面临的问题是:

a) moving the 'key' values up to the column headers a)将“键”值上移到列标题

b) allowing the 'key' values to be dynamic (again, they can and will change) b)允许“键”值是动态的(再次,它们可以并且将改变)

c) removing the square braces ( '[' and ']' ) c)删除方括号( '['']'

d) removing the double slashes ( '//' ) d)删除双斜杠( '//'

e) removing the "L" following the numerical value e)删除数值后的“ L”

Points 'c', 'd' and 'e' above can be addressed with the following issue (which is related): 上面的点“ c”,“ d”和“ e”可以解决以下问题(相关):

How to remove curly braces, apostrophes and square brackets from dictionaries in a Pandas dataframe (Python) 如何从Pandas数据框中的字典中删除花括号,撇号和方括号(Python)

It's points 'a' and 'b' that are the ones I'm struggling with. 我正在努力的要点是“ a”和“ b”。

Does anyone see a way to address these? 有没有人看到解决这些问题的方法?

Thanks! 谢谢!

* UPDATE * *更新*

The data originally posted had a small mistake. 最初发布的数据有一个小错误。 Here is the data: 数据如下:

the_data = """
ABC,2016-6-9 0:00,95,"{'//Purple': [115L], '//Yellow': [403L], '//Blue': [16L], '//White-XYZ': [0L]}"
ABC,2016-6-10 0:00,0,"{'//Purple': [219L], '//Yellow': [381L], '//Blue': [90L], '//White-XYZ': [0L]}"
ABC,2016-6-11 0:00,0,"{'//Purple': [817L], '//Yellow': [21L], '//Blue': [31L], '//White-XYZ': [0L]}"
ABC,2016-6-12 0:00,0,"{'//Purple': [80L], '//Yellow': [2011L], '//Blue': [8888L], '//White-XYZ': [0L]}"
ABC,2016-6-13 0:00,0,"{'//Purple': [32L], '//Yellow': [15L], '//Blue': [4L], '//White-XYZ': [0L]}"
DEF,2016-6-16 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [3L]}"
DEF,2016-6-17 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [0L]}"
DEF,2016-6-18 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [7L]}"
DEF,2016-6-19 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [14L]}"
DEF,2016-6-20 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [21L]}"
"""

The difference between this data and the original data is the apostrophes (") before the opening curly brace ( "{" ) and after the closing curly brace ( "}" ). 此数据与原始数据之间的差异是在大括号( "{" )之前和大括号( "}" )之后的撇号(") "}" )。

I really don't think this pandas can do much for you here. 我真的不认为这只熊猫在这里能为您做很多事情。 You're data is very obtuse and seems to me to be best dealt with using regular expressions. 您的数据非常晦涩,在我看来,最好使用正则表达式。 Here's my solution: 这是我的解决方案:

import re

static_cols = []
dynamic_cols = []
for line in the_data.splitlines():
    if line == '':
        continue

    # deal with static columns
    x = line.split(',')
    company, date, other = x[0:3]
    keys = ['Company', 'Date', 'Other']
    values = [company, date, other]
    d = {i: j for i, j in zip(keys, values)}
    static_cols.append(d)

    # deal with dynamic columns
    keys = re.findall(r'(?<=//)[^\']*', line)
    values = re.findall(r'\d+(?=L)', line)
    d = {i: j for i, j in zip(keys, values)}
    dynamic_cols.append(d)

df1 = pd.DataFrame(static_cols)
df2 = pd.DataFrame(dynamic_cols)
df = pd.concat([df1, df2], axis=1)

And the output: 并输出:

在此处输入图片说明

Also, your data had an extra column after the date I wasn't sure how to deal with so I just called it 'Other'. 另外,在我不确定该如何处理的日期之后,您的数据还有一个额外的列,因此我将其称为“其他”。 It wasn't included in your output, so you can easily remove it if you want as well. 它不包含在您的输出中,因此您也可以根据需要轻松删除它。

Consider converting the dictionary column values as Python dictionaries using ast.literal_eval() and then cast them as individual dataframes for final merge with original dataframe: 考虑使用ast.literal_eval()将字典列值转换为Python字典,然后将其转换为单独的数据ast.literal_eval() ,以便与原始数据ast.literal_eval()进行最终合并:

from io import StringIO
import pandas as pd

import ast
...

df = pd.read_csv(StringIO(the_data), header=None, 
                 names=['Company', 'Date', 'Value', 'Dicts'])

dfList = []
for i in df['Dicts'].tolist():
    result = ast.literal_eval(i.replace('L]', ']'))            
    result = {k.replace('//',''):v for k,v in result.items()}
    temp = pd.DataFrame(result)
    dfList.append(temp)

dictdf = pd.concat(dfList).reset_index(drop=True)
df = pd.merge(df, dictdf, left_index=True, right_index=True).drop(['Dicts'], axis=1)
print(df)

#   Company            Date  Value  Black    Blue  NPO-Green  Pink  Purple  White-XYZ  Yellow
# 0     ABC   2016-6-9 0:00     95    NaN    16.0        NaN   NaN     115        0.0   403.0
# 1     ABC  2016-6-10 0:00      0    NaN    90.0        NaN   NaN     219        0.0   381.0
# 2     ABC  2016-6-11 0:00      0    NaN    31.0        NaN   NaN     817        0.0    21.0
# 3     ABC  2016-6-12 0:00      0    NaN  8888.0        NaN   NaN      80        0.0  2011.0
# 4     ABC  2016-6-13 0:00      0    NaN     4.0        NaN   NaN      32        0.0    15.0
# 5     DEF  2016-6-16 0:00      0   15.0     NaN        3.0   4.0      32        NaN     NaN
# 6     DEF  2016-6-17 0:00      0   15.0     NaN        0.0   4.0      32        NaN     NaN
# 7     DEF  2016-6-18 0:00      0   15.0     NaN        7.0   4.0      32        NaN     NaN
# 8     DEF  2016-6-19 0:00      0   15.0     NaN       14.0   4.0      32        NaN     NaN
# 9     DEF  2016-6-20 0:00      0   15.0     NaN       21.0   4.0      32        NaN     NaN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM