I have a dataframe
df = pd.DataFrame([["A","a*k,x*k,z,c*m,r,s,f*f,e*d"], ["B","h*t,y,a,w*b,Z,c*b,i*t,f*f"]], columns=["id","c1"])
I want to split the column c1 separated by a comma on the below condition:
Example: for 1st row in c1,
Expected Output:
df_out = pd.DataFrame([["A","a*k,x*k,z,c*m,r,s,f*f,e*d","a,c,f","k,m,f"], ["B","h*t,y,a,w*b,Z,c*b,i*b,f*f","h,w,f","t,b,f"]], columns=["id","c1","c2","c3"])
How to do it?
You can use pd.Series.str.extractall
with GroupBy.apply
to drop duplicates and get first 3 strings.
out = df["c1"].str.extractall(r"(.)\*(.)").groupby(level=0)
df[["c2", "c3"]] = out.apply(
lambda x: x.drop_duplicates(subset=1).head(3).agg(",".join)
)
# df
id c1 c2 c3
0 A a*k,x*k,z,c*m,r,s,f*f,e*d a,c,f k,m,f
1 B h*t,y,a,w*b,Z,c*b,i*t,f*f h,w,f t,b,f
First define a function to generate 2 new columns:
def newCols(lst):
return pd.Series(filter(lambda tt: tt.find('*') >= 0, lst))\
.str.split('*', expand=True)\
.rename(columns={0: 'c2', 1: 'c3'})\
.drop_duplicates(subset='c3').iloc[:3]\
.apply(lambda col: ','.join(col))
Then generate the result as:
result = df.join(df.c1.str.split(',').apply(newCols))
The result is:
id c1 c2 c3
0 A a*k,x*k,z,c*m,r,s,f*f,e*d a,c,f k,m,f
1 B h*t,y,a,w*b,Z,c*b,i*t,f*f h,w,f t,b,f
Steps of processing in newCols
pd.Series(filter(lambda tt: tt.find('*') >= 0, lst))
- Create a Series from elements containing an asterisk. str.split('*', expand=True)
- Convert it to a DataFrame. rename(columns={0: 'c2', 1: 'c3'})
- Rename columns to 'c2' and 'c3' . drop_duplicates(subset='c3')
- Remove duplicate rows (with same c3 ). iloc[:3]
- Take only 3 initial rows. apply(lambda col: ','.join(col)
- Join each column into a string. Try to execute them as "increasingly expanding code", on:
lst = ['a*k', 'x*k', 'z', 'c*m', 'r', 's', 'f*f', 'e*d']
(the result from the first source row).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.