[英]top words used in a column of dataframe r and binary column
我有一個這樣的數據框
df= data.frame(
text= c("test and run", "rest and sleep", "test", "test of course"),
id = c('a','b','c','d'))
# text id
#1 test and run a
#2 rest and sleep b
#3 test c
#4 test of course d
我想要
使用緊湊的方式(無循環)來獲取列文本中前2個重復次數最高的單詞(“測試” 3-“和” 2)
創建/添加與前2個值匹配的二進制列。
topTextBinary
1, 1
0, 1
1, 0
1, 0
用於“測試”,“和”
text id topTextBinary
1 test and run a 1, 1
2 rest and sleep b 0, 1
3 test c 1, 0
4 test of course d 1, 0
謝謝
R Studio版本
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 4.3
year 2017
month 11
day 30
svn rev 73796
language R
version.string R version 3.4.3 (2017-11-30)
nickname Kite-Eating Tree
我們可以執行以下操作:
# Word frequency table
tbl <- table(unlist(strsplit(as.character(df$text), " ")));
# Top 2 words
top <- tbl[order(tbl, decreasing = T)][1:2];
# Flag top2 words per row
library(tidyverse);
map(names(top), ~ df %>%
mutate(!!.x := as.numeric(grepl(.x, text)))) %>%
reduce(left_join)
#Joining, by = c("text", "id")
# text id test and
#1 test and run a 1 1
#2 rest and sleep b 0 1
#3 test c 1 0
#4 test of course d 1 0
或unite
從2分二進制列項目分成單個列:
map(names(top), ~ df %>%
mutate(!!.x := as.numeric(grepl(.x, text)))) %>%
reduce(left_join) %>%
unite(topTextBinary, -(1:2), sep = ", ");
# text id topTextBinary
#1 test and run a 1, 1
#2 rest and sleep b 0, 1
#3 test c 1, 0
#4 test of course d 1, 0
使用Base R:
top2=names(sort(table(unlist(strsplit(as.character(df$text),"\\s"))),T))[1:2]
transform(df,m=paste(grepl(top2[1],text)+0,grepl(top2[2],text)+0,sep=","))
text id m
1 test and run a 1,1
2 rest and sleep b 0,1
3 test c 1,0
4 test of course d 1,0
如果目標是使用3,4,甚至前10個單詞,那么您可以考慮執行以下操作:
transform(df,m=do.call(paste,c(sep=",",data.frame(t(outer(top2,df$text,Vectorize(grepl))+0L)))))
text id m
1 test and run a 1,1
2 rest and sleep b 0,1
3 test c 1,0
4 test of course d 1,0
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.