简体   繁体   中英

How do I extract merged data from a cell into its row in a python data frame?

I have a data frame that looks like this:

+---+--------------------------------------------------------------------------------------+---------------+--------------------------------------------+
|   | Date                                                                                 | Professional  | Description                                |
+---+--------------------------------------------------------------------------------------+---------------+--------------------------------------------+
| 0 | 2019-12-19 00:00:00                                                                  | Katie Cool    | Travel to Space ...                        |
+---+--------------------------------------------------------------------------------------+---------------+--------------------------------------------+
| 1 | 2019-12-20 00:00:00                                                                  | Jenn Blossoms | Review stuff; prepare cancellations of ... |
+---+--------------------------------------------------------------------------------------+---------------+--------------------------------------------+
| 2 | 2019-12-27 00:00:00                                                                  | Jenn Blossoms | Review lots of stuff/o...                  |
+---+--------------------------------------------------------------------------------------+---------------+--------------------------------------------+
| 3 | 2019-12-27 00:00:00                                                                  | Jenn Blossoms | Draft email to world leader...             |
+---+--------------------------------------------------------------------------------------+---------------+--------------------------------------------+
| 4 | 2019-12-30 00:00:00                                                                  | Jenn Blossoms | Review this thing.                         |
+---+--------------------------------------------------------------------------------------+---------------+--------------------------------------------+
| 5 | 12-30-2019 Jenn Blossoms Telephone   Call   to   A.   Bell   return   her   multiple | NaN           | NaN                                        |
|   | voicemails.                                                                          |               |                                            |
+---+--------------------------------------------------------------------------------------+---------------+--------------------------------------------+

I would like for it to look like this:

+---+---------------------+---------------+-------------------------------------------------------------+
|   | Date                | Professional  | Description                                                 |
+---+---------------------+---------------+-------------------------------------------------------------+
| 0 | 2019-12-19 00:00:00 | Katie Cool    | Travel to Space ...                                         |
+---+---------------------+---------------+-------------------------------------------------------------+
| 1 | 2019-12-20 00:00:00 | Jenn Blossoms | Review stuff; prepare cancellations of ...                  |
+---+---------------------+---------------+-------------------------------------------------------------+
| 2 | 2019-12-27 00:00:00 | Jenn Blossoms | Review lots of stuff/o...                                   |
+---+---------------------+---------------+-------------------------------------------------------------+
| 3 | 2019-12-27 00:00:00 | Jenn Blossoms | Draft email to world leader...                              |
+---+---------------------+---------------+-------------------------------------------------------------+
| 4 | 2019-12-30 00:00:00 | Jenn Blossoms | Review this thing.                                          |
+---+---------------------+---------------+-------------------------------------------------------------+
| 5 | 12-30-2019          | Jenn Blossoms | Telephone   Call   to   A.   Bell   return   her   multiple |
|   |                     |               | voicemails.                                                 |
+---+---------------------+---------------+-------------------------------------------------------------+

@Datanovice provided a great answer when my question was less specific and needed revision.

I have since edited my question and have also tried to edit his code:

s = pd.to_datetime(dftopdata['Date'],errors='coerce').isna() 
# gives us the error rows to filter.

# split out our datetime column so we can extract the values.
date_err = (
    dftopdata[s]["Date"]
    .str.extract("\d{2}-\d{2}-\d{4}\s+(\w+.*)")[0]
    .str.split("\s", expand=True)
)

# set your values with `.loc` 
dftopdata.loc[s,'Professional'] = date_err[0] + date_err[1]
dftopdata.loc[s,'Description'] = date_err[2]  

But when I run the above code, I get a data frame that looks like this:

+---+---------------------+---------------+--------------------------------------------+
|   | Date                | Professional  | Description                                |
+---+---------------------+---------------+--------------------------------------------+
| 0 | 2019-12-19 00:00:00 | Katie Cool    | Travel to Space ...                        |
+---+---------------------+---------------+--------------------------------------------+
| 1 | 2019-12-20 00:00:00 | Jenn Blossoms | Review stuff; prepare cancellations of ... |
+---+---------------------+---------------+--------------------------------------------+
| 2 | 2019-12-27 00:00:00 | Jenn Blossoms | Review lots of stuff/o...                  |
+---+---------------------+---------------+--------------------------------------------+
| 3 | 2019-12-27 00:00:00 | Jenn Blossoms | Draft email to world leader...             |
+---+---------------------+---------------+--------------------------------------------+
| 4 | 2019-12-30 00:00:00 | Jenn Blossoms | Review this thing.                         |
+---+---------------------+---------------+--------------------------------------------+
| 5 | 12-30-2019          | JennBlossoms  |                                            |
+---+---------------------+---------------+--------------------------------------------+

I also get this error: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

Okay since your errors are consitent, we can use regex and .loc filtering to extract your values.

unfortunately, I don't see a way to reduce the code here (without writing functions but I'm lazy)

s = pd.to_datetime(df['Date'],errors='coerce').isna() 
# gives us the error rows to filter.

# split out our datetime column so we can extract the values.
date_err = (
    df[s]["Date"]
    .str.extract("\d{2}-\d{2}-\d{4}\s+(\w+.*)")[0]
    .str.split("\s", expand=True)
)

# set your values with `.loc` 
df.loc[s,'Professional'] = date_err[0]
df.loc[s,'Description'] = date_err[1]

# extract date.
date = df[s]['Date'].str.extract('(\d{2}-\d{2}-\d{4})')[0] 
df.loc[s,'Date'] = date
#set datetime column.
df['Date'] = pd.to_datetime(df['Date'])

three_err = (
    df[s]["3"].str.extract("([^\[A-Za-z]+)")[0].str.strip().str.split("\s", expand=True)
)   

# set values and replace '3' with nan.
df.loc[s,'Hours'] = three_err[0]
df.loc[s,'Rate'] = three_err[1]
df.loc[s,'Amount'] = three_err[2]
df.loc[s,'3'] = np.nan

print(df)

         Date Professional                      Description    1    2 Hours  \
1  2019-12-19           KL                 Sib ad upoketewm  NaN  NaN   1.9   
3  2019-12-20           JB    Mo wywcig tjovwip pwos es kib  NaN  NaN   0.8   
5  2019-12-27           JB  sop tupherr eq NGINX geflar, ic  NaN  NaN   0.2   
7  2019-12-27           JB   zvsyhebig bytwav xip jfiv cuoj  NaN  NaN   0.1   
9  2019-12-30           JB       Bwijjykg iq kwic pyu febig  NaN  NaN   0.1   
11 2019-12-30           JB                        Telephone  NaN  NaN  0.10   

      3    4    Rate Amount  
1   NaN  NaN     200    380  
3   NaN  NaN     210    168  
5   NaN  NaN     210     42  
7   NaN  NaN     210     21  
9   NaN  NaN     210     21  
11  NaN  NaN  210.00  21.00  

EDIT:

date = df['Date'].str.extract('(\d{2}-\d{2}-\d{4})(\s\w+\s\w+)\s(\w+.*)')[0]
name = df['Date'].str.extract('(\d{2}-\d{2}-\d{4})(\s\w+\s\w+)\s(\w+.*)')[1]
description = df['Date'].str.extract('(\d{2}-\d{2}-\d{4})(\s\w+\s\w+)\s(\w+.*)')[2]


df.loc[pd.to_datetime(df['Date'],errors='coerce').isnull(),'Professional'] = name
df.loc[pd.to_datetime(df['Date'],errors='coerce').isnull(),'Description'] = description
df.loc[pd.to_datetime(df['Date'],errors='coerce').isnull(),'Date'] = date

print(df)


     Date    Professional  \
1   2019-12-19 00:00:00      Katie Cool   
3   2019-12-20 00:00:00   Jenn Blossoms   
5   2019-12-27 00:00:00   Jenn Blossoms   
7   2019-12-27 00:00:00   Jenn Blossoms   
9   2019-12-30 00:00:00   Jenn Blossoms   
11           12-30-2019   Jenn Blossoms   

                                          Description  
1                                 Travel to Space ...  
3          Review stuff; prepare cancellations of ...  
5                           Review lots of stuff/o...  
7                      Draft email to world leader...  
9                                  Review this thing.  
11  Telephone   Call   to   A.   Bell   return   h...  

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM