I have a data frame that looks like this:
+---+--------------------------------------------------------------------------------------+---------------+--------------------------------------------+
| | Date | Professional | Description |
+---+--------------------------------------------------------------------------------------+---------------+--------------------------------------------+
| 0 | 2019-12-19 00:00:00 | Katie Cool | Travel to Space ... |
+---+--------------------------------------------------------------------------------------+---------------+--------------------------------------------+
| 1 | 2019-12-20 00:00:00 | Jenn Blossoms | Review stuff; prepare cancellations of ... |
+---+--------------------------------------------------------------------------------------+---------------+--------------------------------------------+
| 2 | 2019-12-27 00:00:00 | Jenn Blossoms | Review lots of stuff/o... |
+---+--------------------------------------------------------------------------------------+---------------+--------------------------------------------+
| 3 | 2019-12-27 00:00:00 | Jenn Blossoms | Draft email to world leader... |
+---+--------------------------------------------------------------------------------------+---------------+--------------------------------------------+
| 4 | 2019-12-30 00:00:00 | Jenn Blossoms | Review this thing. |
+---+--------------------------------------------------------------------------------------+---------------+--------------------------------------------+
| 5 | 12-30-2019 Jenn Blossoms Telephone Call to A. Bell return her multiple | NaN | NaN |
| | voicemails. | | |
+---+--------------------------------------------------------------------------------------+---------------+--------------------------------------------+
I would like for it to look like this:
+---+---------------------+---------------+-------------------------------------------------------------+
| | Date | Professional | Description |
+---+---------------------+---------------+-------------------------------------------------------------+
| 0 | 2019-12-19 00:00:00 | Katie Cool | Travel to Space ... |
+---+---------------------+---------------+-------------------------------------------------------------+
| 1 | 2019-12-20 00:00:00 | Jenn Blossoms | Review stuff; prepare cancellations of ... |
+---+---------------------+---------------+-------------------------------------------------------------+
| 2 | 2019-12-27 00:00:00 | Jenn Blossoms | Review lots of stuff/o... |
+---+---------------------+---------------+-------------------------------------------------------------+
| 3 | 2019-12-27 00:00:00 | Jenn Blossoms | Draft email to world leader... |
+---+---------------------+---------------+-------------------------------------------------------------+
| 4 | 2019-12-30 00:00:00 | Jenn Blossoms | Review this thing. |
+---+---------------------+---------------+-------------------------------------------------------------+
| 5 | 12-30-2019 | Jenn Blossoms | Telephone Call to A. Bell return her multiple |
| | | | voicemails. |
+---+---------------------+---------------+-------------------------------------------------------------+
@Datanovice provided a great answer when my question was less specific and needed revision.
I have since edited my question and have also tried to edit his code:
s = pd.to_datetime(dftopdata['Date'],errors='coerce').isna()
# gives us the error rows to filter.
# split out our datetime column so we can extract the values.
date_err = (
dftopdata[s]["Date"]
.str.extract("\d{2}-\d{2}-\d{4}\s+(\w+.*)")[0]
.str.split("\s", expand=True)
)
# set your values with `.loc`
dftopdata.loc[s,'Professional'] = date_err[0] + date_err[1]
dftopdata.loc[s,'Description'] = date_err[2]
But when I run the above code, I get a data frame that looks like this:
+---+---------------------+---------------+--------------------------------------------+
| | Date | Professional | Description |
+---+---------------------+---------------+--------------------------------------------+
| 0 | 2019-12-19 00:00:00 | Katie Cool | Travel to Space ... |
+---+---------------------+---------------+--------------------------------------------+
| 1 | 2019-12-20 00:00:00 | Jenn Blossoms | Review stuff; prepare cancellations of ... |
+---+---------------------+---------------+--------------------------------------------+
| 2 | 2019-12-27 00:00:00 | Jenn Blossoms | Review lots of stuff/o... |
+---+---------------------+---------------+--------------------------------------------+
| 3 | 2019-12-27 00:00:00 | Jenn Blossoms | Draft email to world leader... |
+---+---------------------+---------------+--------------------------------------------+
| 4 | 2019-12-30 00:00:00 | Jenn Blossoms | Review this thing. |
+---+---------------------+---------------+--------------------------------------------+
| 5 | 12-30-2019 | JennBlossoms | |
+---+---------------------+---------------+--------------------------------------------+
I also get this error: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
Okay since your errors are consitent, we can use regex and .loc
filtering to extract your values.
unfortunately, I don't see a way to reduce the code here (without writing functions but I'm lazy)
s = pd.to_datetime(df['Date'],errors='coerce').isna()
# gives us the error rows to filter.
# split out our datetime column so we can extract the values.
date_err = (
df[s]["Date"]
.str.extract("\d{2}-\d{2}-\d{4}\s+(\w+.*)")[0]
.str.split("\s", expand=True)
)
# set your values with `.loc`
df.loc[s,'Professional'] = date_err[0]
df.loc[s,'Description'] = date_err[1]
# extract date.
date = df[s]['Date'].str.extract('(\d{2}-\d{2}-\d{4})')[0]
df.loc[s,'Date'] = date
#set datetime column.
df['Date'] = pd.to_datetime(df['Date'])
three_err = (
df[s]["3"].str.extract("([^\[A-Za-z]+)")[0].str.strip().str.split("\s", expand=True)
)
# set values and replace '3' with nan.
df.loc[s,'Hours'] = three_err[0]
df.loc[s,'Rate'] = three_err[1]
df.loc[s,'Amount'] = three_err[2]
df.loc[s,'3'] = np.nan
print(df)
Date Professional Description 1 2 Hours \
1 2019-12-19 KL Sib ad upoketewm NaN NaN 1.9
3 2019-12-20 JB Mo wywcig tjovwip pwos es kib NaN NaN 0.8
5 2019-12-27 JB sop tupherr eq NGINX geflar, ic NaN NaN 0.2
7 2019-12-27 JB zvsyhebig bytwav xip jfiv cuoj NaN NaN 0.1
9 2019-12-30 JB Bwijjykg iq kwic pyu febig NaN NaN 0.1
11 2019-12-30 JB Telephone NaN NaN 0.10
3 4 Rate Amount
1 NaN NaN 200 380
3 NaN NaN 210 168
5 NaN NaN 210 42
7 NaN NaN 210 21
9 NaN NaN 210 21
11 NaN NaN 210.00 21.00
EDIT:
date = df['Date'].str.extract('(\d{2}-\d{2}-\d{4})(\s\w+\s\w+)\s(\w+.*)')[0]
name = df['Date'].str.extract('(\d{2}-\d{2}-\d{4})(\s\w+\s\w+)\s(\w+.*)')[1]
description = df['Date'].str.extract('(\d{2}-\d{2}-\d{4})(\s\w+\s\w+)\s(\w+.*)')[2]
df.loc[pd.to_datetime(df['Date'],errors='coerce').isnull(),'Professional'] = name
df.loc[pd.to_datetime(df['Date'],errors='coerce').isnull(),'Description'] = description
df.loc[pd.to_datetime(df['Date'],errors='coerce').isnull(),'Date'] = date
print(df)
Date Professional \
1 2019-12-19 00:00:00 Katie Cool
3 2019-12-20 00:00:00 Jenn Blossoms
5 2019-12-27 00:00:00 Jenn Blossoms
7 2019-12-27 00:00:00 Jenn Blossoms
9 2019-12-30 00:00:00 Jenn Blossoms
11 12-30-2019 Jenn Blossoms
Description
1 Travel to Space ...
3 Review stuff; prepare cancellations of ...
5 Review lots of stuff/o...
7 Draft email to world leader...
9 Review this thing.
11 Telephone Call to A. Bell return h...
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.