[英]Building a dataframe by parsing character vectors in R
I'm new to R and struggling with the construction of a dataset out of a museum's collection. 我是R的新手,正在努力从博物馆的藏品中构建数据集。
After scraping their website, I have a list of character vectors (let's say the name is "characteristics") in which each element looks like this: 抓取他们的网站后,我得到了一个字符向量列表(假设名称为“特征”),其中每个元素如下所示:
[[4729]]
[1] " Date://2002 Medium://Pencil on paper Dimensions://22 1/2 x 30 1/8\" (57.2 x 76.5 cm) Credit Line://The Judith Rothschild Foundation Contemporary Drawings Collection Gift MoMA Number://1563.2005 Copyright://© 2015 Steve DiBenedetto"
from these vectors, I want to make a dataframe that looks like this: 从这些向量,我想制作一个数据帧,如下所示:
year medium dimensions credit line number
1 2002 Pencil on paper etc...
However, I can't seem to manage to substract the necessary data out of the character vectors as I'm struggling with the regex's to do this. 但是,我似乎无法设法从字符向量中减去必要的数据,因为我正努力使用正则表达式。 The idea would be to fetch what comes after "Date://" and before "Medium://". 想法是获取在“ Date://”之后和“ Medium://”之前的内容。 To make matters more complicated, not every element in the list has the same characteristics in the same order (eg some elements only have "date" and "medium" while others include "edition://", "acquired through://", etc). 为了使事情变得更复杂,列表中的每个元素并非都具有相同的特征(例如,某些元素仅具有“日期”和“中号”,而其他元素包括“ edition://”,“通过://获取”)等)。
A list of the years was pretty easy to compile by just saving the first 4 digits in each list element: 只需保存每个列表元素中的前4位数字,就可以很容易地编制年份列表:
year <- list()
for(p in 1:length(characteristics)) {
string <- as.character(characteristics[p])
year <- c(year, str_extract(string, "\\d\\d\\d\\d"))
}
This is probably not even the fastest way to do it, but it does the job well. 这可能甚至不是最快的方法,但效果很好。 However, I'm completely stuck on extracting the other variables out of the list. 但是,我完全坚持从列表中提取其他变量。
Maybe good old read.table
is an option, too: 也许不错的旧read.table
也可以选择:
txt <- c("Date://2002 Medium://Pencil on paper Dimensions://22 1/2 x 30 1/8\" (57.2 x 76.5 cm) Credit Line://The Judith Rothschild Foundation Contemporary Drawings Collection Gift MoMA Number://1563.2005 Copyright://© 2015 Steve DiBenedetto",
"Date://2002 Medium://Pencil on paper Dimensions://22 1/2 x 30 1/8\" (57.2 x 76.5 cm) Credit Line://The Judith Rothschild Foundation Contemporary Drawings Collection Gift MoMA Number://1563.2005 Copyright://© 2015 Steve DiBenedetto")
read.table(text = gsub("( Credit)?\\s?[A-z]+://", "\t", txt), sep = "\t", quote = "", col.names = letters[1:7])[-1]
# b c d e f g
# 1 2002 Pencil on paper 22 1/2 x 30 1/8" (57.2 x 76.5 cm) The Judith Rothschild Foundation Contemporary Drawings Collection Gift MoMA 1563.2 © 2015 Steve DiBenedetto
# 2 2002 Pencil on paper 22 1/2 x 30 1/8" (57.2 x 76.5 cm) The Judith Rothschild Foundation Contemporary Drawings Collection Gift MoMA 1563.2 © 2015 Steve DiBenedetto
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.