简体   繁体   English

从 R 中的数据点提取一些数字的最有效方法是什么? (加上其他具体步骤!)

[英]What is the most efficient way of extracting some numbers from a data point in R? (Plus other specific steps!)

I've got quite a specific problem, for which I can just about find a very hacky solution, but I'm hoping somebody could outline a slightly more elegant method.我有一个非常具体的问题,对此我几乎可以找到一个非常hacky的解决方案,但我希望有人可以概述一个稍微更优雅的方法。

I have a CSV file, consisting of one row per historical football match played.我有一个 CSV 文件,每场历史足球比赛由一行组成。 The fields I care about look something like this:我关心的字段看起来像这样:

home_team <- c("Team A", "Team B", "Team B")
away_team <- c("Team C", "Team C", "Team D")
home_goals <- c(2, 0, 1)
away_goals <- c(1, 2, 0)
home_goal_mins <- c("5 60", "NA", "80")
away_goal_mins <- c("15", "20 40", "NA")

df <- data.frame(home_team, away_team, home_goals, away_goals, home_goal_mins, away_goal_mins,
                 stringsAsFactors = FALSE)

df
#>   home_team away_team home_goals away_goals home_goal_mins away_goal_mins
#> 1    Team A    Team C          2          1           5 60             15
#> 2    Team B    Team C          0          2             NA          20 40
#> 3    Team B    Team D          1          0             80             NA

Created on 2020-10-05 by the reprex package (v0.3.0)reprex 包(v0.3.0) 于 2020 年 10 月 5 日创建

My goal is to transform this dataframe such that there is one line per goal scored, per game, like this:我的目标是转换这个数据框,使得每场比赛每个进球都有一条线,如下所示: 在此处输入图片说明

The main challenges, as I see them:在我看来,主要挑战是:

  1. The *_goal_mins fields are read in as strings containing both numbers and NAs *_goal_mins字段作为包含数字和 NA 的字符串读入
  2. Replicating the rows such that the Home/Away team combinations have the same number of rows as the total number of goals for that match复制行,使主/客队组合的行数与该比赛的总进球数相同

With regards to (1), I've been using stringr::str_split(., " ") to extract the numbers but then struggle to transform them into a numeric vector.关于(1),我一直在使用stringr::str_split(., " ")来提取数字,但随后很难将它们转换为数字向量。 Taking the first row of df as an example, I'm struggling to transform "5 60" into c(5, 60) , and it gets harder for me when I try to combine the home team's "5 60" with the away team's "15" to get the full goal sequence of c(5, 15, 60) .以第一排df为例,我正在努力将"5 60"转换为c(5, 60) ,当我尝试将主队的"5 60"与客队的"15"得到c(5, 15, 60)的完整目标序列。

As for (2), my current approach is to calculate the total_goals_scored per match, and do the following:至于(2),我目前的做法是计算每场比赛的总total_goals_scored数,并执行以下操作:

expanded_df <- df[rep(seq_len(dim(df)[1]),
                      df$total_goals_scored), ]

but I sense that there may be a better method.但我觉得可能有更好的方法。

Any help or tips will be appreciated!任何帮助或提示将不胜感激! Thanks谢谢

Using dplyr and tidyr library you could do使用dplyrtidyr库,你可以做

  1. bring home_goal_mins and away_goal_mins in same column using pivot_longer .使用pivot_longerhome_goal_minsaway_goal_mins放在同一列中。
  2. Split the data on whitespace and separate the goals in separate rows在空白处拆分数据并将目标分开在单独的行中
  3. Drop NA values删除NA
  4. arrange data based on timestamp根据时间戳arrange数据
  5. Get data in wide format.以宽格式获取数据。
library(dplyr)
library(tidyr)

df %>%
  pivot_longer(cols = c(home_goal_mins, away_goal_mins)) %>%
  separate_rows(value, sep = ' ', convert = TRUE) %>%
  filter(!is.na(value)) %>%
  arrange(home_team, away_team, value) %>%
  group_by(home_team, away_team) %>%
  mutate(row = row_number()) %>%
  pivot_wider()

#  home_team away_team home_goals away_goals   row home_goal_mins away_goal_mins
#  <chr>     <chr>          <dbl>      <dbl> <int>          <int>          <int>
#1 Team A    Team C             2          1     1              5             NA
#2 Team A    Team C             2          1     2             NA             15
#3 Team A    Team C             2          1     3             60             NA
#4 Team B    Team C             0          2     1             NA             20
#5 Team B    Team C             0          2     2             NA             40
#6 Team B    Team D             1          0     1             80             NA

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从R data.frame中选择一组变量名称的最有效方法是什么? - What is the most efficient way to select a set of variable names from an R data.frame? 从 R 中的大型 XML 文件中提取数据的最有效方法 - Most efficient way to extract data from large XML files in R 在 R 中粘贴字符串的最有效方法是什么? - What is the most efficient way to paste strings in R? 将数据框的某些列附加到其他某些列上最有效 - Most efficient to append some columns of a data frame to some other columns 将 R 中的 data.frame 与 while 循环生成的数据放在一起的最有效方法是什么? - What is the most efficient way to put together a data.frame in R with data generated by a while loop? R - 从数据帧内的嵌套数据帧中获取特定行的最有效方法 - R - most efficient way to get specific row from nested dataframe inside a dataframe 使用 R 的 tidyverse,过滤出满足多列条件的数据的最有效方法是什么? - Using R's tidyverse, what is the most efficient way to filter out data that meet conditions across multiple columns? 从文本中提取数据的最有效方法 - Most efficient way to extract data from text 从 R 中的 2 个逻辑向量计算混淆矩阵的最有效方法是什么? - What is the most efficient way of computing a confusion matrix from 2 logical vectors in R? 在 R 中生成从 -2 到 5 且相差 0.1 的值向量的最有效方法是什么? - What's the most efficient way to generate a vector of values from -2 to 5 with a difference of 0.1 in R?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM