[英]Creating a data frame from regmatches in R
我已经寻找了很多,但是不明白如何将正则匹配的输出转换为可以导出的任何内容。 希望这个问题不是那么具体,它对社区毫无价值。 我在以下链接中遇到了类似的问题:
但是,我无法从regmatches列表中找出如何保存/导出/制作数据框。 理想情况下,每个具有标签的标签都应保存在单独的列中。 但是只要我尝试我都会得到类似的输出:
[[6267]]
character(0)
[[6268]]
[1] "#ASCO15"
[[6269]]
[1] "#FDA" "#Fast" "#Track" "#AML" "#Pancreatic"
如果我尝试导出regmatch的结果,则会得到:
Error in data.frame(character(0), character(0), character(0), character(0), :
arguments imply differing number of rows: 0, 8, 2, 3, 5, 1, 4, 7, 6, 9
谢谢
编辑:对不起,我可能在解释自己方面做得很差。
dput(hi)
structure(list(text = c("Hooray ! #Wimbledon2Day has plugged its brain back in at last ! No more sub- Top Gear telly #propertenniscoverage",
"gone but never forgotten #TopGear ", "The final episode of 'Top Gear' with Jeremy Clarkson is going to break records http://brbr.co/1JCeJYc\312"
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-3L), .Names = "text")
从这些数据中,我想提取主题标签(#)及其后的单词,并将其分配给列。 上面链接中的代码完成了第一部分。
test<-regmatches(hi$text,gregexpr("#(\\d|\\w)+",hi$text),)
给我:
[[1]]
[1] "#Wimbledon2Day" "#propertenniscoverage"
[[2]]
[1] "#TopGear"
[[3]]
character(0)
但是当我尝试检查或导出它时,我得到:
Error in data.frame(c("#Wimbledon2Day", "#propertenniscoverage"), "#TopGear", :
arguments imply differing number of rows: 2, 1, 0
如果您有大量推文和唯一的主题标签,则应考虑使用稀疏矩阵。 您可以在arules
包中找到一个这样的稀疏矩阵对象itemMatrix
。 您可以将列表直接强制到此稀疏矩阵中,而不必在sapply
的答案中写出unique
且sapply
步骤(这是一个很好的基本解决方案,我给他+1)。
foo <- c("RddzAlejandra: RT @NiallOfficial: What a day for @johnJoeNevin ! Sooo proud t have been there to see him at #London2012 and here in mgar #MullingarShuffle","BPOInsight: RT @atos: Atos completes delivery of key IT systems for London 2012 Olympic Games http://t.co/Modkyo2R #london2012","BloombergWest: The #Olympics sets a ratings record for #NBC, with 219M viewers tuning in. http://t.co/scGzIXBp #london2012 #tech")
ms <- regmatches(foo, gregexpr("#(\\d|\\w)+", foo)) # extract hashtags from tweet (from other post)
library(arules)
im <- as(ms, "itemMatrix")
#you can retrieve the rows like this
as(im,"matrix")
# #london2012 #London2012 #MullingarShuffle #NBC #Olympics #tech
# 1 0 1 1 0 0 0
# 2 1 0 0 0 0 0
# 3 1 0 0 1 1 1
使用链接文章中的示例,
foo <- c("RddzAlejandra: RT @NiallOfficial: What a day for @johnJoeNevin ! Sooo proud t have been there to see him at #London2012 and here in mgar #MullingarShuffle","BPOInsight: RT @atos: Atos completes delivery of key IT systems for London 2012 Olympic Games http://t.co/Modkyo2R #london2012","BloombergWest: The #Olympics sets a ratings record for #NBC, with 219M viewers tuning in. http://t.co/scGzIXBp #london2012 #tech")
ms <- regmatches(foo, gregexpr("#(\\d|\\w)+", foo)) # extract hashtags from tweet (from other post)
cols <- unique(unlist(ms)) # get unique hashtags
setNames(data.frame(t(sapply(ms, function(i) cols %in% i))), cols)
# #London2012 #MullingarShuffle #london2012 #Olympics #NBC #tech
# 1 TRUE TRUE FALSE FALSE FALSE FALSE
# 2 FALSE FALSE TRUE FALSE FALSE FALSE
# 3 FALSE FALSE TRUE TRUE TRUE TRUE
这些行对应于这些推文。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.