[英]R - raw text to a data.frame
我处理来自扫描目录的原始文本数据。 我想将我的字符串向量转换为data.frame对象。 我的载体由按字母顺序列出的人进行了一项或多项工作。
-人名大写。
-每件作品都有编号。
-编号工作是连续的。
AADFDS
1 Lorem ipsum dolor sit amet, consectetur adipiscing elit.
AB
2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis.
BBDDED
3 Nunc et eros eget turpis sollicitudin mollis id et mi.
4 Mauris condimentum velit eu consequat feugiat.
5 Suspendisse sit amet metus vitae est eleifend tincidunt.
CCDDFSF
6 Sed cursus augue in tempus scelerisque.
7 in commodo enim in laoreet gravida.
预期结果1
Author Work
AA DFDS 1 Lorem ipsum dolor sit amet, consectetur adipiscing elit.
AB 2 Nulla sollicitudin elit in purus egestas, in placerat velit
BBDDED 3 Nunc et eros eget turpis sollicitudin mollis id et
BBDDED 4 Mauris condimentum velit eu consequat feugiat.
BBDDED 5 Suspendisse sit amet metus vitae est eleifend tincidunt.
CCDDFSF 6 Sed cursus augue in tempus scelerisque.
CCDDFSF 7 in commodo enim in laoreet gravida.
预期结果2,每个工作都有一个专栏
Author | Work1 | Work2 | Work3 | Work(x)
数据通过以下方式导入到R中:
readlines ("clipboard", encoding = " latin1 ")
我能够使用不同的正则表达式识别包括大写字母在内的艺术家姓名的行
例如
^[A-ZÁÀÂÄÃÅÇÉÈÊËÍÌÎÏÑÓÒÔÖÕÚÙÛÜÝYÆO][A-ZÁÀÂÄÃÅÇÉÈÊËÍÌÎÏÑÓÒÔÖÕÚÙÛÜÝYÆO |']
我能够识别包括艺术品在内的线条
^[0-9]+[\\s]
任何帮助将不胜感激。
这样可以为您的样品数据提供正确的结果。
txt="
AADFDS
1 Lorem ipsum dolor sit amet, consectetur adipiscing elit.
AB
2 Nulla sollicitudin elit in purus egestas, in placerat velit iaculis.
BBDDED
3 Nunc et eros eget turpis sollicitudin mollis id et mi.
4 Mauris condimentum velit eu consequat feugiat.
5 Suspendisse sit amet metus vitae est eleifend tincidunt.
CCDDFSF
6 Sed cursus augue in tempus scelerisque.
7 in commodo enim in laoreet gravida."
last_author=""
author_count=0
#the first scan splits the data by line, i.e., sep="\n"
#then for each line, we split by whitespace, i.e., sep=" "
#if the first element is numeric we increase the
#respective author's work counter "author_count" and
#we return the the work in a data.frame
#if the first element is non-numeric, we have
#encountered a new author
#we store the new author name in "last_author"
#(and remove trailing whitespaces at the end)
result1=do.call("rbind",
lapply(as.list(scan(text=txt,
what="character",
sep="\n",
quiet=TRUE)),
function(x) {
tmp=scan(text=x,what="character",sep=" ",quiet=TRUE)
if (grepl("[0-9]",tmp[1])) {
author_count<<-author_count+1
data.frame(Author=last_author,N=author_count,Work=x)
} else {
last_author<<-gsub("\\s*$","",x)
author_count<<-0
NULL
}}))
#we pivot the data; rows correspond to authors, columns to works
result2=reshape2::dcast(result1,Author~N,value.var = "Work")
#just renaming the columns
colnames(result2)[-1]=paste0("Work",1:(ncol(result2)-1))
result2
toydata<- readLines("clipboard")
#find lines beginning with any number; flags lines with authors
work_id <- grepl("^[0-9]" , toydata)
#rle finds subsequent runs of an element within a vector
RLE <- rle(work_id)
#work_id filters out the lines with author names
#rep(toydata[!work_id],RLE$lengths[RLE$values]) repeats the ...
#... author name (times = number of author's works)
df_toydata <- data.frame(work = toydata[work_id],
Author = rep(toydata[!work_id],
RLE$lengths[RLE$values]),
stringsAsFactors=FALSE)
#we have to order the data.frame by author just in case
#some author appears again
df_toydata=df_toydata[order(df_toydata$Author),]
#we can now add a column with a numbering of each author's works
df_toydata$N=sequence(rle(df_toydata$Author)$lengths)
#format long to large
#we pivot the data; rows correspond to authors, columns to works
df2=reshape2::dcast(df_toydata,Author~N,value.var = "work")
colnames(df2)[-1]=paste0("Work",1:(ncol(df2)-1))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.