I have a dataframe:
import pandas as pd
df = pd.DataFrame(
{
"Qty": [1,2,2,4,5,4,3],
"Date": ['2020-12-16', '2020-12-17', '2020-12-18', '2020-12-19', '2020-12-20', '2020-12-21', '2020-12-22'],
"Item": ['22-A', 'R-22-A', '33-CDE', 'R-33-CDE', '55-A', '22-AB', '55-AB'],
"Price": [1.1, 2.2, 2.2, 4.4, 5.5, 4.4, 3.3]
})
I'm trying to duplicate each row where the Item suffix has 2 or more characters, and then change the value of the Item. For example, the row containing '22-AB' will become two rows. In the first row the Item will be '22-A', and in the 2nd it will be '22-B'. All this should be done only if the item number (without suffix) is in a 'clean' list.
Here is the pseudocode for what I'm trying to achieve:
Clean list of items = ['11', '22', '33']
For each row, check if substring of df["Item"] is in clean list.
if no:
skip row and leave it as it is
if yes:
check if len(suffix) >= 2
if no:
skip row and leave it as it is
if yes:
separate the item (11, 22, or 33) and the suffix
for char in suffix:
newitem = concat item + char
duplicate the row, replacing the old item with newitem
if number started with R-, prepend the R- again
The desired output:
df2 = pd.DataFrame(
{
"Qty": [1,2,2,2,2,4,4,4,5,4,4,3,3],
"Date": ['2020-12-16', '2020-12-17', '2020-12-18', '2020-12-18', '2020-12-18', '2020-12-19', '2020-12-19', '2020-12-19', '2020-12-20', '2020-12-21', '2020-12-21', '2020-12-22', '2020-12-22'],
"Item": ['22-A', 'R-22-A', '33-C', '33-D', '33-E', 'R-33-C', 'R-33-D', 'R-33-E', '55-A', '22-A', '22-B', '55-A', '55-B'],
"Price": [1.1, 2.2, 2.2, 2.2, 2.2, 4.4, 4.4, 4.4, 5.5, 4.4, 4.4, 3.3, 3.3]
})
What I have come up with so far:
mains = ['11', '22', '33']
for i in df["Item"]:
iptrn = re.compile(r'\d{2}')
optrn = re.compile('(?<=[0-9]-).*')
item = bptrn.search(i).group(0)
option = optrn.search(i).group(0)
if item in mains:
for o in option:
combo = item + "-" + o
print(combo)
I can't figure out the last step of actually duplicating the row. I've tried this: df = df.loc[df.index.repeat(1)].assign(Item=combo, num=len(option)-1).reset_index(drop=True), but it doesn't replace the Item correctly
You can use pandas operations to do the work here
It seems like the first step is to separate the two parts of the item code with pandas string methods (here, use extract
with expand=True
)
>>> item_code = df['Item'].str.extract('(?P<ic1>R?-?\d+)-+(?P<ic2>\w+)', expand=True)
>>> item_code
ic1 ic2
0 22 A
1 R-22 A
2 33 CDE
3 R-33 CDE
4 55 A
5 22 AB
6 55 AB
You can add these columns directly to df - I just included that snippet above to show you the output from the extract operation.
>>> df = df.join(df['Item'].str.extract('(?P<ic1>R?-?\d+)-+(?P<ic2>\w+)', expand=True))
>>> df
Qty Date Item Price ic1 ic2
0 1 2020-12-16 22-A 1.1 22 A
1 2 2020-12-17 R-22-A 2.2 R-22 A
2 2 2020-12-18 33-CDE 2.2 33 CDE
3 4 2020-12-19 R-33-CDE 4.4 R-33 CDE
4 5 2020-12-20 55-A 5.5 55 A
5 4 2020-12-21 22-AB 4.4 22 AB
6 3 2020-12-22 55-AB 3.3 55 AB
Next, I would build up a python data structure and convert it to a dataframe at the end rather than trying to insert rows or change existing rows.
data = []
for row in df.itertuples(index=False):
for character in row.ic2:
data.append({
'Date': row.Date,
'Qty': row.Qty,
'Price': row.Price,
'Item': f'{row.ic1}-{character}'
})
newdf = pd.DataFrame(data)
The new dataframe looks like this
>>> newdf
Date Qty Price Item
0 2020-12-16 1 1.1 22-A
1 2020-12-17 2 2.2 R-22-A
2 2020-12-18 2 2.2 33-C
3 2020-12-18 2 2.2 33-D
4 2020-12-18 2 2.2 33-E
5 2020-12-19 4 4.4 R-33-C
6 2020-12-19 4 4.4 R-33-D
7 2020-12-19 4 4.4 R-33-E
8 2020-12-20 5 5.5 55-A
9 2020-12-21 4 4.4 22-A
10 2020-12-21 4 4.4 22-B
11 2020-12-22 3 3.3 55-A
12 2020-12-22 3 3.3 55-B
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.