简体   繁体   English

通过解析R中的字符向量构建数据帧

[英]Building a dataframe by parsing character vectors in R

I'm new to R and struggling with the construction of a dataset out of a museum's collection. 我是R的新手,正在努力从博物馆的藏品中构建数据集。

After scraping their website, I have a list of character vectors (let's say the name is "characteristics") in which each element looks like this: 抓取他们的网站后,我得到了一个字符向量列表(假设名称为“特征”),其中每个元素如下所示:

[[4729]]
[1] " Date://2002 Medium://Pencil on paper Dimensions://22 1/2 x 30 1/8\" (57.2 x 76.5 cm) Credit Line://The Judith Rothschild Foundation Contemporary Drawings Collection Gift MoMA Number://1563.2005 Copyright://© 2015 Steve DiBenedetto"

from these vectors, I want to make a dataframe that looks like this: 从这些向量,我想制作一个数据帧,如下所示:

     year    medium           dimensions    credit line    number
1   2002     Pencil on paper   etc...

However, I can't seem to manage to substract the necessary data out of the character vectors as I'm struggling with the regex's to do this. 但是,我似乎无法设法从字符向量中减去必要的数据,因为我正努力使用正则表达式。 The idea would be to fetch what comes after "Date://" and before "Medium://". 想法是获取在“ Date://”之后和“ Medium://”之前的内容。 To make matters more complicated, not every element in the list has the same characteristics in the same order (eg some elements only have "date" and "medium" while others include "edition://", "acquired through://", etc). 为了使事情变得更复杂,列表中的每个元素并非都具有相同的特征(例如,某些元素仅具有“日期”和“中号”,而其他元素包括“ edition://”,“通过://获取”)等)。

A list of the years was pretty easy to compile by just saving the first 4 digits in each list element: 只需保存每个列表元素中的前4位数字,就可以很容易地编制年份列表:

year <- list()

for(p in 1:length(characteristics)) {
  string <- as.character(characteristics[p])
  year <- c(year, str_extract(string, "\\d\\d\\d\\d"))
  }

This is probably not even the fastest way to do it, but it does the job well. 这可能甚至不是最快的方法,但效果很好。 However, I'm completely stuck on extracting the other variables out of the list. 但是,我完全坚持从列表中提取其他变量。

Maybe good old read.table is an option, too: 也许不错的旧read.table也可以选择:

txt <- c("Date://2002 Medium://Pencil on paper Dimensions://22 1/2 x 30 1/8\" (57.2 x 76.5 cm) Credit Line://The Judith Rothschild Foundation Contemporary Drawings Collection Gift MoMA Number://1563.2005 Copyright://© 2015 Steve DiBenedetto",
         "Date://2002 Medium://Pencil on paper Dimensions://22 1/2 x 30 1/8\" (57.2 x 76.5 cm) Credit Line://The Judith Rothschild Foundation Contemporary Drawings Collection Gift MoMA Number://1563.2005 Copyright://© 2015 Steve DiBenedetto")
read.table(text = gsub("( Credit)?\\s?[A-z]+://", "\t", txt), sep = "\t", quote = "", col.names = letters[1:7])[-1]
#      b               c                                 d                                                                           e      f                        g
# 1 2002 Pencil on paper 22 1/2 x 30 1/8" (57.2 x 76.5 cm) The Judith Rothschild Foundation Contemporary Drawings Collection Gift MoMA 1563.2 © 2015 Steve DiBenedetto
# 2 2002 Pencil on paper 22 1/2 x 30 1/8" (57.2 x 76.5 cm) The Judith Rothschild Foundation Contemporary Drawings Collection Gift MoMA 1563.2 © 2015 Steve DiBenedetto

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 R正则表达式:包含NA的字符向量的问题 - R regex: issues with character vectors containing NAs R中字符向量的快速转义/解析 - Fast escaping/deparsing of character vectors in R 用R中的正则表达式从字符串中解析价格 - Parsing price out of a character string with regex in R R:在特定字符处将长度n&gt; 1的字符向量拆分为n个向量的列表 - R: Split character vector of length n>1, at specific character, into a list of n vectors 解析Dataframe中大文本列中的特定文本 - R. - Parsing out particular text in a big text column in a Dataframe - R R - 将 i 逗号分隔 ID 的字符向量分解为数据帧的 i 个离散向量 - R - break character vector of i comma-separated IDs into i discrete vectors of a data frame 如何遍历R字符向量列表以通过将所有字符保持为包括一个字符超过逗号来修改每个元素 - How to iterate through an R list of character vectors to modify each element by keeping all characters up to and including one character past comma R:将字符向量与数据帧中的文本描述匹配并返回值 - R: Match a character vector to text description in dataframe and return value 使用正则表达式构造多词短语的字符向量以在R中使用Quanteda构建dfm - Construct a character vector of multi-word phrases using regex for building dfm using quanteda in R 使用换行符解析的正则表达式 - Regex for parsing with newline character
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM