简体   繁体   中英

Separate character variable into two columns

I have scraped some data from a url to analyse cycling results. Unfortunately the name column exists of the name and the name of the team in one field. I would like to extract these from each other. Here's the code (last part doesn't work)

#get url
stradebianchi_2020 <- read_html("https://www.procyclingstats.com/race/strade-bianche/2020/result")

#scrape table 
results_2020 <- stradebianchi_2020%>%
  html_nodes("td")%>%
  html_text()

#transpose scraped data into dataframe
results_stradebianchi_2020 <- as.data.frame(t(matrix(results_2020, 8, byrow = F)))

#rename
names(results_stradebianchi_2020) <- c("rank", "#", "name", "age", "team", "UCI point", "PCS points", "time")

#split rider from team

separate(data = results_stradebianchi_2020, col = name, into = c("left", "right"), sep = " ")

I think the best option is to get the team variable name and use that name to remove it from the 'name' column.

All suggestions are welcome!

I think your request is wrongly formulated. You want to remove team from name .

That's how you should do it in my opinion:

results_stradebianchi_2020 %>% 
    mutate(name = stringr::str_remove(name, team))

Write this instead of your line with separate .

In this case separate is not an optimal solution for you because the separation character is not clearly defined.

Also, I would advise you to remove the initial blanks from name with stringr::str_trim(name)

You could do this in base R with gsub and replace in the name column the pattern of team column with "" , ie nothing. We use apply() with MARGIN=1 to go through the data frame row by row. Finally we use trimws to clean from whitespace (where we change to whitespace="[\\h\\v]" for better matching the spaces).

res <- transform(results_stradebianchi_2020,
                 name=trimws(apply(results_stradebianchi_2020, 1, function(x) 
                   gsub(x["team"], "", x["name"])), whitespace="[\\h\\v]"))
head(res)
#   rank  X.                  name age                    team UCI.point PCS.points           time
# 1    1 201         van Aert Wout  25        Team Jumbo-Visma       300        200 4:58:564:58:56
# 2    2 234        Formolo Davide  27       UAE-Team Emirates       250        150       0:300:30
# 3    3  87 Schachmann Maximilian  26        BORA - hansgrohe       215        120       0:320:32
# 4    4 111       Bettiol Alberto  26          EF Pro Cycling       175        100       1:311:31
# 5    5  44        Fuglsang Jakob  35         Astana Pro Team       120         90       2:552:55
# 6    6   7         Štybar Zdenek  34 Deceuninck - Quick Step       115         80       3:593:59

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM