[英]Creating a table in R with several columns with frequency of string matches over time
我有一個看起來像這樣的數據框:
Day text place gender
Feb 20 2016 #geom and #stats SP M
Feb 20 2016 #geom and #stats SP F
Feb 20 2016 #bio and #stats SP M
我想從“文本”中提取主題標簽,然后使用這些信息(摘要)構建一個表:
Day Hashtag Daily_Freq %men %women Freq_UK Freq_SP
Feb 20 2016 #stats 2 0.5 0.5 1 1
Feb 20 2016 #maths 1 1 0 1 0
Feb 20 2016 #geom 1 0 1 0 1
我不知道如何執行此操作! 誰能幫我?
options(stringsAsFactors = FALSE)
df = read.table(text = " Day text place gender
'Feb 20 2016' '#stats and #maths' UK M
'Feb 20 2016' '#geom and #stats' SP F",
header = TRUE)
# extract tags
tags = lapply(strsplit(df$text, split = "[\\s,\\t]+", perl = TRUE),
function(item) item[substr(item, 1, 1)=="#"])
# create list of data.frames
long_list1 = lapply(seq_len(NROW(df)), function(i) {
data.frame(
Day = df[["Day"]][i],
Hashtag = tags[[i]],
place = df[["place"]][i],
gender = df[["gender"]][i]
)
})
# long form - each hashtag on each own row
long = do.call(rbind, long_list1)
# compute list of data.frames with statistics
long_list2 =
lapply(
split(long, list(long$Day, long$Hashtag)),
function(item){
with(item, data.frame(
Day = Day[1],
Hashtag = Hashtag[1],
Daily_Freq = NROW(item),
'%men' = mean(gender == "M"),
'%women' = mean(gender == "F"),
Freq_UK = sum(place == "UK"),
Freq_SP = sum(place == "SP"),
check.names = FALSE
))
})
# combine result
res = do.call(rbind, c(long_list2, make.row.names = FALSE))
res
#
# Day Hashtag Daily_Freq %men %women Freq_UK Freq_SP
# 1 Feb 20 2016 #geom 1 0.0 1.0 0 1
# 2 Feb 20 2016 #maths 1 1.0 0.0 1 0
# 3 Feb 20 2016 #stats 2 0.5 0.5 1 1
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.