将数据框分成适当的列

Question

I have extracted data from a pdf and now have a dataframe (my_data) of 3085 observations (characters) of 1 variable (stuff), here are two rows:我从 pdf 中提取了数据，现在有一个数据框（my_data），其中包含 1 个变量（东西）的 3085 个观察值（字符），这里有两行：

2012 Q-2 1004001115648091 001 2011-12-02 10 000,00 $ 2 500,00 $ 10,00 $ 495,65 $ 13 005,65 $

2012 Q-2 r.19 1004001113343232 001 2009-11-05 50 000,00 $ 2 900,00 $ 10,00 $ 52 910,00 $

How do I separate this into 11 variables, as it was originally in the pdf, and fill the blanks with NAs?我如何将它分成 11 个变量，因为它最初在 pdf 中，并用 NA 填充空白？ Good separation would look like this:好的分离看起来像这样：

2012 / Q-2 / NA / 1004001115648091 / 001 / 2011-12-02 / 10 000,00 $ / 2 500,00 $ / 10,00 $ / 495,65 $ / 13 005,65 $

2012 / Q-2 / r.19 / 1004001113343232 / 001 / 2009-11-05 / 50 000,00 $ / 2 900,00 $ / 10,00 $ / 52 910,00 $

I am trying to find a way to do it with separate(), but I don't have a good grasp of regular expressions and the best I could achieve so far, based on an online blog, was this:我正在尝试找到一种方法来使用 separate()，但我对正则表达式没有很好的掌握，根据在线博客，到目前为止我能做到的最好的是：

my_data %>% 
  separate(stuff, c("A","B", "C", "D", "E", "F", "G", "H", "I", "K", "L"), sep = "\\s")

Which creates a separation at every white space.这在每个空白处创建了一个分隔。 This is problematic as it separates $ from amounts and 1 000 into two different columns, and it does not fill the blank with NAs when there are missing values, instead shifting the whole thing to fill the gap.这是有问题的，因为它将 $ 与金额和 1 000 分开到两个不同的列中，并且当存在缺失值时它不会用 NA 填充空白，而是移动整个内容以填补空白。

Answer 1

Try with this.试试这个。 However, as it is ever the case with regexps out of small samples, i'm not sure if it covers all cases.但是，由于小样本中的正则表达式总是如此，我不确定它是否涵盖所有情况。

data = c(
"2012 Q-2 1004001115648091 001 2011-12-02 10 000,00 $ 2 500,00 $ 10,00 $ 495,65 $ 13 005,65 $",
"2012 Q-2 r.19 1004001113343232 001 2009-11-05 50 000,00 $ 2 900,00 $ 10,00 $ 52 910,00 $")

r <- regexec(paste0(
"(\\d{4}) (Q-\\d) (?:([^ ]+) )?(\\d{16}) (\\d{3}) (\\d{4}-\\d{2}-\\d{2}) ",
"(-?\\d{1,3}(?: \\d{3})*,\\d{2} \\$) (-?\\d{1,3}(?: \\d{3})*,\\d{2} \\$) ",
"(-?\\d{1,3}(?: \\d{3})*,\\d{2} \\$)"), data) 

do.call(rbind, regmatches(data, r))[,-1]
#>      [,1]   [,2]  [,3]   [,4]               [,5]  [,6]         [,7]         
#> [1,] "2012" "Q-2" ""     "1004001115648091" "001" "2011-12-02" "10 000,00 $"
#> [2,] "2012" "Q-2" "r.19" "1004001113343232" "001" "2009-11-05" "50 000,00 $"
#>      [,8]         [,9]     
#> [1,] "2 500,00 $" "10,00 $"
#> [2,] "2 900,00 $" "10,00 $"

将数据框分成适当的列

问题描述

1 个解决方案

解决方案1
0 2022-12-19 23:59:39

将数据框分成适当的列

问题描述

1 个解决方案

解决方案1 0 2022-12-19 23:59:39

解决方案1
0 2022-12-19 23:59:39