简体   繁体   English

在r中将列分成多列时丢失数据

[英]Missing data when separating column into multiple columns in r

I scraped a table from a pdf and everything went into one element of the data frame.我从 pdf 中抓取了一张表格,所有内容都放入了数据框的一个元素中。 I managed to separate everything into separate columns, but r got confused with the column names.我设法将所有内容分成单独的列,但 r 对列名感到困惑。 The first column is "State" and should include all the state names, but after separation, is blank.第一列是“州”,应该包括所有州名,但分开后是空白的。 The second column is "State Drug Formulary," and after separation, incorrectly includes the state names in it.第二列是“州药物处方集”,在分离后,错误地在其中包含了州名称。 It's also missing a lot of other information.它还缺少许多其他信息。 Any possible fixes?任何可能的修复?

For simplicity, I renamed the column "x".为简单起见,我将列重命名为“x”。

library(tabulizer)
library(pdftools)
library(rJava)
library(tidyverse)
url4 = "https://oppe.pharmacy.washington.edu/PracticumSite/forms/2019_Survey_of_Pharmacy_Law.pdf?-session=Students_Session%3A42F94F5D0a61a20754trv33D875D&fbclid=IwAR0qeK2tYmyI7T_8ict1Hnew9JxPkpt0bvajI3KL3IFDWg6JHNSSFWGlKY4"

out <- pdf_text(url4)
df=as.data.frame(out[[93]],header=F)
df = df %>%
  rename(x = `out[[93]]`) %>% 
    mutate(x=strsplit(x, "\n")) %>%
    unnest(x)
df=df[-c(1:2),]
df2=df %>% separate(x, c("State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"))

What the table should look like .表应该是什么样子 Pg 82 of the original document if you visit the source.如果您访问源文件,请访问原始文件的第 82 页。

I also tried this, which kept the col names, but removed the data我也试过这个,它保留了 col 名称,但删除了数据

df3 = df %>% separate(x, sep = " ", into = c("State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"))

Page 82 includes other contents like 21. Drug Product Selection Laws or etc.第82页包括其他内容,如21. Drug Product Selection Laws等。

You'd better remove them like,你最好像删除它们一样,

dummy <- strsplit(df$`out[[93]]`, '\\n\n')

This process will split that page into four part and table what you are looking for is second object of that list.此过程将该页面分为四个部分,并且您要查找的表格是该列表的第二个对象。

df2 <- df %>%
  rename(x = `out[[93]]`) %>%
  mutate(x = stringr::str_split(x, '\\n\n',simplify = T)[2]) %>%
  mutate(x = strsplit(x, '\\n')) %>%
  unnest() %>%
  .[-c(1:3), ]

Now df2 is the table contents.现在df2是表格内容。 So, splitting this with more than two whitespace,所以,用两个以上的空格分开这个,

df2 %>% separate(x, c("a","State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"), sep = "\\s{2,}") %>%
  select(-a)

will give the result.会给出结果。 'a' is dummy that result from separate have blank value at the front. 'a' 是虚拟的,由separate的结果在前面有空白值。 Here's some part of the result.这是结果的一部分。

  State   `State Drug Fo…` `Two-line Rx F…` `Permissive or…` `How to Preven…`
   <chr>   <chr>            <chr>            <chr>            <chr>           
 1 Alabama None             Yes              P, BBB           A               
 2 Alaska  None             No               P                B               
 3 Arizona None             No               P                I               
 4 Arkans… None             No               P                B               
 5 Califo… None             No               P                EE              
 6 Colora… None             No               P                J               
 7 Connec… None             No               P                E, F            
 8 Delawa… None             No               P                E               
 9 Distri… Positive         No               P                B               
10 Florida Negative L       No               M                B   

Do it in one line from dfdf一行完成

df %>%
  rename(x = `out[[93]]`) %>%
  mutate(x = stringr::str_split(x, '\\n\n',simplify = T)[2]) %>%
  mutate(x = strsplit(x, '\\n')) %>%
  unnest() %>%
  .[-c(1:3),] %>%
  separate(x, c("a","State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"), sep = "\\s{2,}") %>%
  select(-a)

You may try this你可以试试这个

as.data.frame(pdf_text(url4)[[93]],header=F) %>%
  rename(x = `out[[93]]`) %>%
  mutate(x = stringr::str_split(x, '\\n\n',simplify = T)[2]) %>%
  mutate(x = strsplit(x, '\\n')) %>%
  unnest() %>%
  .[-c(1:3),] %>%
  separate(x, c("a","State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"), sep = "\\s{2,}") %>%
  select(-a)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM