簡體   English   中英

從R中的正則匹配創建數據幀

[英]Creating a data frame from regmatches in R

我已經尋找了很多,但是不明白如何將正則匹配的輸出轉換為可以導出的任何內容。 希望這個問題不是那么具體,它對社區毫無價值。 我在以下鏈接中遇到了類似的問題:

使用R在幾條推文中提取主題標簽

但是,我無法從regmatches列表中找出如何保存/導出/制作數據框。 理想情況下,每個具有標簽的標簽都應保存在單獨的列中。 但是只要我嘗試我都會得到類似的輸出:

[[6267]]
character(0)

[[6268]]
[1] "#ASCO15"

[[6269]]
[1] "#FDA"        "#Fast"       "#Track"      "#AML"        "#Pancreatic"    

如果我嘗試導出regmatch的結果,則會得到:

Error in data.frame(character(0), character(0), character(0), character(0),  : 
  arguments imply differing number of rows: 0, 8, 2, 3, 5, 1, 4, 7, 6, 9 

謝謝

編輯:對不起,我可能在解釋自己方面做得很差。

dput(hi)
structure(list(text = c("Hooray ! #Wimbledon2Day has plugged its brain back in at last ! No more sub- Top Gear telly #propertenniscoverage", 
"gone but never forgotten #TopGear ", "The final episode of 'Top Gear' with Jeremy Clarkson is going to break records http://brbr.co/1JCeJYc\312"
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-3L), .Names = "text")

從這些數據中,我想提取主題標簽(#)及其后的單詞,並將其分配給列。 上面鏈接中的代碼完成了第一部分。

test<-regmatches(hi$text,gregexpr("#(\\d|\\w)+",hi$text),)

給我:

[[1]]
[1] "#Wimbledon2Day"        "#propertenniscoverage"

[[2]]
[1] "#TopGear"

[[3]]
character(0)

但是當我嘗試檢查或導出它時,我得到:

Error in data.frame(c("#Wimbledon2Day", "#propertenniscoverage"), "#TopGear",  : 
  arguments imply differing number of rows: 2, 1, 0

如果您有大量推文和唯一的主題標簽,則應考慮使用稀疏矩陣。 您可以在arules包中找到一個這樣的稀疏矩陣對象itemMatrix 您可以將列表直接強制到此稀疏矩陣中,而不必在sapply的答案中寫出uniquesapply步驟(這是一個很好的基本解決方案,我給他+1)。

foo <- c("RddzAlejandra: RT @NiallOfficial: What a day for @johnJoeNevin ! Sooo proud t have been there to see him at #London2012 and here in mgar #MullingarShuffle","BPOInsight: RT @atos: Atos completes delivery of key IT systems for London 2012 Olympic Games http://t.co/Modkyo2R #london2012","BloombergWest: The #Olympics sets a ratings record for #NBC, with 219M viewers tuning in. http://t.co/scGzIXBp #london2012 #tech")

ms <- regmatches(foo, gregexpr("#(\\d|\\w)+", foo))  # extract hashtags from tweet (from other post)

library(arules)
im <- as(ms, "itemMatrix")

#you can retrieve the rows like this
as(im,"matrix")
#   #london2012 #London2012 #MullingarShuffle #NBC #Olympics #tech
# 1           0           1                 1    0         0     0
# 2           1           0                 0    0         0     0
# 3           1           0                 0    1         1     1

使用鏈接文章中的示例,

foo <- c("RddzAlejandra: RT @NiallOfficial: What a day for @johnJoeNevin ! Sooo proud t have been there to see him at #London2012 and here in mgar #MullingarShuffle","BPOInsight: RT @atos: Atos completes delivery of key IT systems for London 2012 Olympic Games http://t.co/Modkyo2R #london2012","BloombergWest: The #Olympics sets a ratings record for #NBC, with 219M viewers tuning in. http://t.co/scGzIXBp #london2012 #tech")

ms <- regmatches(foo, gregexpr("#(\\d|\\w)+", foo))  # extract hashtags from tweet (from other post)
cols <- unique(unlist(ms))                           # get unique hashtags

setNames(data.frame(t(sapply(ms, function(i) cols %in% i))), cols)

#   #London2012 #MullingarShuffle #london2012 #Olympics  #NBC #tech
# 1        TRUE              TRUE       FALSE     FALSE FALSE FALSE
# 2       FALSE             FALSE        TRUE     FALSE FALSE FALSE
# 3       FALSE             FALSE        TRUE      TRUE  TRUE  TRUE

這些行對應於這些推文。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM