简体   繁体   English

R:将长单列dataframe展开成两列,按字母和数字拆分,忽略标点符号

[英]R: Spread long single column dataframe into two columns, spliting by alpha and numeric, ignoring punctuation

I have a large dataset that contains keywords, followed eventually by a value.我有一个包含关键字的大型数据集,最后是一个值。 I have managed to read the data in from a pdf format, and am left with data that looks like the following:我已经设法从 pdf 格式读取数据,剩下的数据如下所示:

  myData <- c("adjuster", "7", "hours", "rate", "oct 2 - 16," , "19", "hours", "rate", "_NA_NA_NA_NA_", "total", "gross", "pay", "6500", "_NA_NA_NA_table",  "NA_copy", "of", "9.16.19 to 9.30.19.xlsx_NA")

myDataDF <- as.data.frame(myData)

My goal is to 'spread' that single column of character data into two columns, one for the alpha values, the second for the numeric values that follow below.我的目标是将单列字符数据“传播”成两列,一列用于 alpha 值,第二列用于下面的数值。 I would like to to bring over punctuation, but ignore it as a means of separating keywords from values, as some of the numeric values have punctuation.我想带上标点符号,但忽略它作为将关键字与值分开的一种方式,因为一些数值有标点符号。 I would like to collapse (with a space) the keywords, until a numeric value is found, which then is placed in the values column.我想折叠(用空格)关键字,直到找到一个数值,然后将其放在值列中。

I have tried a number of things with this data in different formats (long strings and string splitting), but this format seems the most conducive and clean to get me to the end goal (having data to actually analyze and perform calculations).我已经用不同格式的数据(长字符串和字符串拆分)尝试了很多事情,但这种格式似乎最有利于和干净地让我达到最终目标(有数据来实际分析和执行计算)。 I just don't know how to qualify keep collapsing until you hit a number in R.我只是不知道如何获得资格一直崩溃,直到你在 R 中命中一个数字。

Ultimately, it would be nice if looked as such:最终,如果看起来像这样就好了:

+==========================================+============================+
|                 keyword                  |           value            |
+==========================================+============================+
| adjuster                                 | 7                          |
+------------------------------------------+----------------------------+
| hours rate oct 2 - 16                    | 19                         |
+------------------------------------------+----------------------------+
| hours rate _NA_NA_NA_NA_ total gross pay | 6500                       |
+------------------------------------------+----------------------------+
| _NA_NA_NA_table NA_copy of               | 9.16.19 to 9.30.19.xlsx_NA |
+------------------------------------------+----------------------------+

The last row pattern is not very clear.最后一排图案不是很清楚。 Based on the data, we could create a grouping column by detecting only numeric values or the 'xlsx' in the 'myData' column and then summarise by paste ing the values except the last and the second column as the last value根据数据,我们可以通过仅检测数值或“myData”列中的“xlsx”来创建分组列,然后通过pastelast列和第二列之外的值作为last值来进行summarise

library(dplyr)
library(stringr)
myDataDF %>% 
     group_by(grp = lag(cumsum(str_detect(myData, '^\\d+$|xlsx')), 
          default = 0)) %>% 
     summarise(keyword = str_c(myData[-n()], collapse = ' '), 
               value = last(myData), .groups = 'drop') %>% 
     select(-grp)

-output -输出

# A tibble: 4 x 2
#  keyword                                  value                     
#  <chr>                                    <chr>                     
#1 adjuster                                 7                         
#2 hours rate oct 2 - 16,                   19                        
#3 hours rate _NA_NA_NA_NA_ total gross pay 6500                      
#4 _NA_NA_NA_table NA_copy of               9.16.19 to 9.30.19.xlsx_NA

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM