[英]What is the most efficient way of extracting some numbers from a data point in R? (Plus other specific steps!)
I've got quite a specific problem, for which I can just about find a very hacky solution, but I'm hoping somebody could outline a slightly more elegant method.我有一个非常具体的问题,对此我几乎可以找到一个非常hacky的解决方案,但我希望有人可以概述一个稍微更优雅的方法。
I have a CSV file, consisting of one row per historical football match played.我有一个 CSV 文件,每场历史足球比赛由一行组成。 The fields I care about look something like this:
我关心的字段看起来像这样:
home_team <- c("Team A", "Team B", "Team B")
away_team <- c("Team C", "Team C", "Team D")
home_goals <- c(2, 0, 1)
away_goals <- c(1, 2, 0)
home_goal_mins <- c("5 60", "NA", "80")
away_goal_mins <- c("15", "20 40", "NA")
df <- data.frame(home_team, away_team, home_goals, away_goals, home_goal_mins, away_goal_mins,
stringsAsFactors = FALSE)
df
#> home_team away_team home_goals away_goals home_goal_mins away_goal_mins
#> 1 Team A Team C 2 1 5 60 15
#> 2 Team B Team C 0 2 NA 20 40
#> 3 Team B Team D 1 0 80 NA
Created on 2020-10-05 by the reprex package (v0.3.0)由reprex 包(v0.3.0) 于 2020 年 10 月 5 日创建
My goal is to transform this dataframe such that there is one line per goal scored, per game, like this:我的目标是转换这个数据框,使得每场比赛每个进球都有一条线,如下所示:
The main challenges, as I see them:在我看来,主要挑战是:
*_goal_mins
fields are read in as strings containing both numbers and NAs *_goal_mins
字段作为包含数字和 NA 的字符串读入With regards to (1), I've been using stringr::str_split(., " ")
to extract the numbers but then struggle to transform them into a numeric vector.关于(1),我一直在使用
stringr::str_split(., " ")
来提取数字,但随后很难将它们转换为数字向量。 Taking the first row of df
as an example, I'm struggling to transform "5 60"
into c(5, 60)
, and it gets harder for me when I try to combine the home team's "5 60"
with the away team's "15"
to get the full goal sequence of c(5, 15, 60)
.以第一排
df
为例,我正在努力将"5 60"
转换为c(5, 60)
,当我尝试将主队的"5 60"
与客队的"15"
得到c(5, 15, 60)
的完整目标序列。
As for (2), my current approach is to calculate the total_goals_scored
per match, and do the following:至于(2),我目前的做法是计算每场比赛的总
total_goals_scored
数,并执行以下操作:
expanded_df <- df[rep(seq_len(dim(df)[1]),
df$total_goals_scored), ]
but I sense that there may be a better method.但我觉得可能有更好的方法。
Any help or tips will be appreciated!任何帮助或提示将不胜感激! Thanks
谢谢
Using dplyr
and tidyr
library you could do使用
dplyr
和tidyr
库,你可以做
home_goal_mins
and away_goal_mins
in same column using pivot_longer
.pivot_longer
将home_goal_mins
和away_goal_mins
放在同一列中。NA
valuesNA
值arrange
data based on timestamparrange
数据library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = c(home_goal_mins, away_goal_mins)) %>%
separate_rows(value, sep = ' ', convert = TRUE) %>%
filter(!is.na(value)) %>%
arrange(home_team, away_team, value) %>%
group_by(home_team, away_team) %>%
mutate(row = row_number()) %>%
pivot_wider()
# home_team away_team home_goals away_goals row home_goal_mins away_goal_mins
# <chr> <chr> <dbl> <dbl> <int> <int> <int>
#1 Team A Team C 2 1 1 5 NA
#2 Team A Team C 2 1 2 NA 15
#3 Team A Team C 2 1 3 60 NA
#4 Team B Team C 0 2 1 NA 20
#5 Team B Team C 0 2 2 NA 40
#6 Team B Team D 1 0 1 80 NA
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.