简体   繁体   中英

In python or R I want a more efficient way to string split a text in a column into four columns

I have a column named BREADS with 5 rows, I want to split the column and values into 4 columns namely B , REA , D and S .


BREADS
>2319-22-<21
>1513-16-<19
>1319-25-<22
>1617-21-<25
>1011-15-<17

Desired outcome


B, REA , D, S    ### column names
>23 , 19-22 , - , <21
>15 , 13-16 , - , <19
>13 , 19-25 , - , <22
>16 , 17-21 , - , <25
>10 , 11-15 , - , <17

# Key: > greater than and < less than, - hyphen in the column 'D'

My attempt

###### in python
# for column 'B'
df['B'] = df['BREADS'].astype(str).str[0:4]   # returns '>23','>15',.....,'>10'


#### in R 

library(stringr)
str_split_fixed(df$BREADS, "", 2)

An option with extract from tidyr in R

library(dplyr)
library(tidyr)
df1 %>% 
 extract(BREADS, into = c('B', 'REA', 'D', 'S'),
        '^(\\>..)(\\d{2}-\\d{2})(-)(.*)')

-output

#  B   REA D   S
#1 >23 19-22 - <21
#2 >15 13-16 - <19
#3 >13 19-25 - <22
#4 >16 17-21 - <25
#5 >10 11-15 - <17

data

df1 <- structure(list(BREADS = c(">2319-22-<21", ">1513-16-<19", ">1319-25-<22", 
">1617-21-<25", ">1011-15-<17")), class = "data.frame", row.names = c(NA, 
-5L))

For Python:

d={'B': (0,4), 'REA':(3,8), 'D':(8,9), 'S':(9:20)}
for i in d:
    df[i]=df['BREADS'].apply(lambda x: x[d[i][0]:d[i][1])

You can use pandas str.extract to pull the data into separate columns; the assumption here is that the data is uniform for each row:

pattern = r"(?P<B>>.{2})(?P<REA>.{2}-.{2})(?P<D>-)(?P<S><.{2})"

df.BREADS.str.extract(pattern)

      B  REA    D    S
0   >23 19-22   -   <21
1   >15 13-16   -   <19
2   >13 19-25   -   <22
3   >16 17-21   -   <25
4   >10 11-15   -   <17

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM