简体   繁体   中英

Most efficient way to separate character columns into rows and combine multiple columns into one column in r

UPDATED

I web scraped a table online that wasn't actually structured as a table. I managed to separate the characters into multiple rows, but for future reference, would like to know of a more efficient way to do this for larger data sets.

I also was able to get everything into one column, but the entire code is wildly inefficient. Any suggestions for improvement?

library(rvest)
library(tidyverse)
library(dplyr)

url = "https://www.ncsl.org/research/health/state-laws-and-legislation-related-to-biologic-medications-and-substitution-of-biosimilars.aspx"
webpage=read_html(url)

mandatory_2014 = webpage %>% 
  html_element(css = "#dnn_ctr84472_HtmlModule_lblContent > div > table:nth-child(15)") %>% 
  html_table()
mandatory_2014 = data.frame(mandatory_2014)

df = mandatory_2014 %>% 
  mutate(X1=strsplit(X1, "\n\n\t\t\t")) %>% 
  unnest(X1) %>% 
  mutate(X2=strsplit(X2, "\n\n\t\t\t")) %>% 
  unnest(X3)%>% 
  mutate(X3=strsplit(X3, "\n\n\t\t\t")) %>% 
  unnest(X3)
df = df[-c(2)]
df = stack(df)
df = df[-c(2)]
df = data.frame(df[!duplicated(df),])
df = rename(df, States = df..duplicated.df....)

This may be done in base R more easily - unlist the columns to a vector , then replace one or more occurrence ( + ) of \n\t with a single , as well as removing the characters that starts from the ( , then either use strsplit or scan to split the string into individual elements (using delimiter , ), apply trimws to remove any remaining leading/lagging spaces, and convert it to a data.frame column

out <- data.frame(States = trimws(scan(text = sub("\\s+\\(.*", "",
   gsub("(\\n+\\t+)", ",", mandatory_2014)), what="", sep=",")))

-output

> out
           States
1         Florida
2          Kansas
3        Kentucky
4   Massachusetts
5       Minnesota
6     Mississippi
7          Nevada
8      New Jersey
9        New York
10   Pennsylvania
11    Puerto Rico
12   Rhode Island
13     Washington
14  West Virginia 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM