简体   繁体   中英

Removing square brackets and apostrophes from a dataframe

I am a surgeon trying to analyse some patient data.I have a dataframe of patients (271x15) who have had multiple operations. This is from a larger (4010x71) dataframe of single operations using much help from @Arne . Essentially (see post original post ) using a pivot table then looking for multiple (>=2) operations. This is great. I am interested in the first two operations and the dates to get the number of days between them to see how long an implant lasted. The dataframe head is here and shows the patient ID and the codes (V011 and V014) for the insetion and removal of the implant.

                                 OPERTN_01      OPDATE_01
ID      
11                              [V011, V014]    [2016-06-21, 2017-02-27]
13                              [V011, V014]    [2016-07-14, 2016-01-14]
14                              [V014, V011]    [2014-02-25, 2014-07-01]
15                              [V014, V011]    [2014-06-26, 2015-04-16]

I was hoping to subtract the dates of the two operations by

  1. Removing the square brackets
  2. Splitting the ?tuples into two columns
  3. Ensuring the dates are pd.datetime
  4. Subtracting the two dates.

I am stuck at removing the brackets. I have tried replace df.replace("[", "") , which has no effect on the dataframe or on the series OPERTN_01 . Ideally I would like to remove the square brackets throughout the dataframe rather than column by column.

The lists produced in this dataframe (thanks @Arne ) have produced great descriptive statistics but are difficult for me to manipulate.

I also have the problem that the dates in OPDATE_01 are not sorted so the difference between the dates is often negative. Could be that I am wanting to do too much at one of course..

Are you looking for something like this:

from io import StringIO
import ast
import pandas as pd

# ------ create sample data ------
s = """ID;OPERTN_01;OPDATE_01
11;["V011", "V014"];["2016-06-21", "2017-02-27"]
13;["V011", "V014"];["2016-07-14", "2016-01-14"]
14;["V014", "V011"];["2014-02-25", "2014-07-01"]
15;["V014", "V011"];["2014-06-26", "2015-04-16"]"""

df = pd.read_csv(StringIO(s), sep=';')
df['OPERTN_01'] = df['OPERTN_01'].apply(ast.literal_eval)
df['OPDATE_01'] = df['OPDATE_01'].apply(ast.literal_eval)
df = df.set_index('ID')

# ------ end sample data ------

# list comprehension to sort and convert str to datetime
df['OPDATE_01'] = [sorted([pd.to_datetime(x[0]), pd.to_datetime(x[1])]) for x in df['OPDATE_01']]

# if your values in the list are already datetime then ignore what is above and do
# df['OPDATE_01'] = df['OPDATE_01'].apply(sorted)

# apply pd.Series to explode your list into columns and then rename col if you want
date = df['OPDATE_01'].apply(pd.Series).rename(columns={0:'OPDATE_01_0', 1:'OPDATE_01_1'})
# calculate the difference between dates
date.diff(axis=1)

   OPDATE_01_0 OPDATE_01_1
ID                        
11         NaT    251 days
13         NaT    182 days
14         NaT    126 days
15         NaT    294 days

OR

# list comprehension to sort and convert list to datetime
df['OPDATE_01'] = [sorted([pd.to_datetime(x[0]), pd.to_datetime(x[1])]) for x in df['OPDATE_01']]

# if your values in the list are already datetime then ignore what is above and do
# df['OPDATE_01'] = df['OPDATE_01'].apply(sorted)

# apply pd.Series to explode your list into columns and then rename col if you want
date = df['OPDATE_01'].apply(pd.Series).rename(columns={0:'OPDATE_01_0', 1:'OPDATE_01_1'})
# merge two frames on ID to maintain all columns
m = df['OPERTN_01'].to_frame().merge(date, left_index=True, right_index=True)
# calc diff and assign to new column
m['diff'] = m.diff(axis=1)['OPDATE_01_1']

       OPERTN_01 OPDATE_01_0 OPDATE_01_1     diff
ID                                               
11  [V011, V014]  2016-06-21  2017-02-27 251 days
13  [V011, V014]  2016-01-14  2016-07-14 182 days
14  [V014, V011]  2014-02-25  2014-07-01 126 days
15  [V014, V011]  2014-06-26  2015-04-16 294 days

per your comment

# just changing variable name to match your comment
df_implants = m

# convert OPERTN_01 to a string
s = df_implants['OPERTN_01'].apply(str)

# boolean indexing to filter df_implants where OPERTN_01 is equal to ['V011', 'V014']
v011v014 = df_implants[(s == "['V011', 'V014']")]

# boolean indexing to filter df_implants where OPERTN_01 is equal to ['V014', 'V011']
v014v011 = df_implants[(s == "['V014', 'V011']")]

v011v014

       OPERTN_01 OPDATE_01_0 OPDATE_01_1     diff
ID                                               
11  [V011, V014]  2016-06-21  2017-02-27 251 days
13  [V011, V014]  2016-01-14  2016-07-14 182 days

v014v011

       OPERTN_01 OPDATE_01_0 OPDATE_01_1     diff
ID                                               
14  [V014, V011]  2014-02-25  2014-07-01 126 days
15  [V014, V011]  2014-06-26  2015-04-16 294 days

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM