简体   繁体   English

在R中创建具有几列的表,随着时间的推移字符串匹配的频率

[英]Creating a table in R with several columns with frequency of string matches over time

I've a data frame which looks something like this: 我有一个看起来像这样的数据框:

     Day           text          place    gender
Feb 20 2016   #geom and #stats      SP          M
Feb 20 2016   #geom and #stats      SP          F
Feb 20 2016   #bio and #stats       SP          M

I want to extract the hashtags from "text" and then build a table with these informations (summary): 我想从“文本”中提取主题标签,然后使用这些信息(摘要)构建一个表:

Day          Hashtag    Daily_Freq  %men    %women  Freq_UK Freq_SP
Feb 20 2016   #stats      2              0.5      0.5     1       1
Feb 20 2016   #maths      1              1        0       1       0
Feb 20 2016   #geom       1              0        1       0       1

I don't have a clue how to do this! 我不知道如何执行此操作! Can anyone help me? 谁能帮我?

options(stringsAsFactors = FALSE)
df = read.table(text = " Day           text            place           gender
                        'Feb 20 2016'   '#stats and #maths'      UK          M
                        'Feb 20 2016'   '#geom and #stats'       SP          F", 
                 header = TRUE)

# extract tags
tags =  lapply(strsplit(df$text, split = "[\\s,\\t]+", perl = TRUE), 
           function(item) item[substr(item, 1, 1)=="#"])

# create list of data.frames 
long_list1 = lapply(seq_len(NROW(df)), function(i) {
    data.frame(
        Day = df[["Day"]][i],
        Hashtag = tags[[i]],
        place = df[["place"]][i],
        gender = df[["gender"]][i]
    )
})

# long form - each hashtag on each own row
long = do.call(rbind, long_list1)

# compute list of data.frames with statistics 
long_list2 = 
        lapply(
            split(long, list(long$Day, long$Hashtag)), 
            function(item){
                with(item, data.frame(
                    Day = Day[1],
                    Hashtag = Hashtag[1],
                    Daily_Freq  = NROW(item), 
                    '%men' = mean(gender == "M"),   
                    '%women' = mean(gender == "F"),   
                    Freq_UK  = sum(place == "UK"), 
                    Freq_SP = sum(place == "SP"),
                    check.names = FALSE

                ))
            })

# combine result
res = do.call(rbind, c(long_list2, make.row.names = FALSE))
res
# 
#         Day Hashtag Daily_Freq %men %women Freq_UK Freq_SP
# 1 Feb 20 2016   #geom          1  0.0    1.0       0       1
# 2 Feb 20 2016  #maths          1  1.0    0.0       1       0
# 3 Feb 20 2016  #stats          2  0.5    0.5       1       1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM