简体   繁体   中英

Building a dataframe by parsing character vectors in R

I'm new to R and struggling with the construction of a dataset out of a museum's collection.

After scraping their website, I have a list of character vectors (let's say the name is "characteristics") in which each element looks like this:

[[4729]]
[1] " Date://2002 Medium://Pencil on paper Dimensions://22 1/2 x 30 1/8\" (57.2 x 76.5 cm) Credit Line://The Judith Rothschild Foundation Contemporary Drawings Collection Gift MoMA Number://1563.2005 Copyright://© 2015 Steve DiBenedetto"

from these vectors, I want to make a dataframe that looks like this:

     year    medium           dimensions    credit line    number
1   2002     Pencil on paper   etc...

However, I can't seem to manage to substract the necessary data out of the character vectors as I'm struggling with the regex's to do this. The idea would be to fetch what comes after "Date://" and before "Medium://". To make matters more complicated, not every element in the list has the same characteristics in the same order (eg some elements only have "date" and "medium" while others include "edition://", "acquired through://", etc).

A list of the years was pretty easy to compile by just saving the first 4 digits in each list element:

year <- list()

for(p in 1:length(characteristics)) {
  string <- as.character(characteristics[p])
  year <- c(year, str_extract(string, "\\d\\d\\d\\d"))
  }

This is probably not even the fastest way to do it, but it does the job well. However, I'm completely stuck on extracting the other variables out of the list.

Maybe good old read.table is an option, too:

txt <- c("Date://2002 Medium://Pencil on paper Dimensions://22 1/2 x 30 1/8\" (57.2 x 76.5 cm) Credit Line://The Judith Rothschild Foundation Contemporary Drawings Collection Gift MoMA Number://1563.2005 Copyright://© 2015 Steve DiBenedetto",
         "Date://2002 Medium://Pencil on paper Dimensions://22 1/2 x 30 1/8\" (57.2 x 76.5 cm) Credit Line://The Judith Rothschild Foundation Contemporary Drawings Collection Gift MoMA Number://1563.2005 Copyright://© 2015 Steve DiBenedetto")
read.table(text = gsub("( Credit)?\\s?[A-z]+://", "\t", txt), sep = "\t", quote = "", col.names = letters[1:7])[-1]
#      b               c                                 d                                                                           e      f                        g
# 1 2002 Pencil on paper 22 1/2 x 30 1/8" (57.2 x 76.5 cm) The Judith Rothschild Foundation Contemporary Drawings Collection Gift MoMA 1563.2 © 2015 Steve DiBenedetto
# 2 2002 Pencil on paper 22 1/2 x 30 1/8" (57.2 x 76.5 cm) The Judith Rothschild Foundation Contemporary Drawings Collection Gift MoMA 1563.2 © 2015 Steve DiBenedetto

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM