[英]How to find search words from a table, in another table, and then create new columns of the results?
I'm trying to find specifice words listed in a tibble arbeit
in the another tibble rawEng$Text
. 我试图在另一个rawEng$Text
找到arbeit
中列出的特殊单词。 If a word, or words, were found, I want to create, or mutate, a new data frame iDataArbeit
with two new columns, one for the found word/s wArbeit
, and one for the sum of there tf-idf iArbeit
scores from arbeit$tfidf
如果找到一个或多个单词,我想创建或iDataArbeit
一个新的数据框iDataArbeit
,其中包含两列新列,一个列用于找到的单词wArbeit
,一个列用于存储tf-idf iArbeit
得分的总和arbeit$tfidf
My Data: 我的资料:
arbeit: 尽管:
X1 feature tfidf
<dbl> <chr> <dbl>
1 0 sick 0.338
2 2 contract 0.188
3 3 pay 0.175
4 4 job 0.170
5 5 boss 0.169
6 6 sozialversicherungsnummer 0.169
rawEng: rawEng:
Gender Gruppe Datum Text
<chr> <chr> <dttm> <chr>
1 F Berlin Expats 2017-07-07 00:00:00 Anyone out there who's had to apply for Führung~
2 F FAB 2018-01-18 00:00:00 Dear FAB, I am in need of a Führungszeugnis no ~
3 M Free Advice ~ 2017-01-30 00:00:00 Dear Friends, i would like to ask you how can I~
4 M FAB 2018-04-12 00:00:00 "Does anyone know why the \"Standesamt Pankow (~
5 F Berlin Expats 2018-11-12 00:00:00 having trouble finding consistent information a~
6 F Toytown Berl~ 2017-06-08 00:00:00 "Hello\r\n\r\nI have a question regarding Airbn~
I've tried with dplyr::mutate
, using this code: 我已经尝试使用dplyr::mutate
,使用以下代码:
idataEnArbeit <- mutate(rawEng, wArbeit = ifelse((str_count(rawEng$Text, arbeit$feature))>=1,
arbeit$feature, NA),
iArbeit = ifelse((str_count(rawEng$Text, arbeit$feature))>=1,
arbeit$tfidf, NA))
but all I get is one Word, and it's tf-idf score, in the new columens iDatatArbeit$wArbeit
and iDataArbeit$iArbeit
但是我得到的只是一个单词,在新的iDatatArbeit$wArbeit
和iDataArbeit$iArbeit
是tf-idf分数
Gender Gruppe Datum Text wArbeit iArbeit
<chr> <chr> <dttm> <chr> <chr> <dbl>
1 F Berlin | Girl ~ 2018-09-11 13:22:05 "11 septembre, 13:21 GGI ~ sick 0.338
2 F ExpatBabies Be~ 2017-10-19 16:24:23 "16:24 Babysitter needed! B~ sick 0.338
3 F Berlin | Girl ~ 2018-06-22 18:24:19 "gepostet. Leonor Valen~ sick 0.338
4 F 'Neu in Berlin' 2018-09-18 23:19:51 "Hello guys, I am working wit~ sick 0.338
5 M Free Advice Be~ 2018-04-27 08:49:24 "In need of legal advice: Wha~ sick 0.338
6 F Free Advice Be~ 2018-07-04 18:33:03 "Is there somebody I can pay ~ sick 0.338
In summary: I want all words from arbeit$feature
which are found in rawEng$Text
to be added in iDataArbeit$wArbeit
, and the sum of there tf-idf score to be added in iDataArbeit$iArbeit
总结:我希望将在rawEng$Text
中找到的来自arbeit$feature
所有单词都添加到iDataArbeit$wArbeit
,并将tf-idf得分的总和添加到iDataArbeit$iArbeit
Since I don't have your data, I'll import the gutenbergr library and play w/ Treasure Island. 由于我没有您的数据,因此我将导入gutenbergr库并使用金银岛玩。
library(tidytext)
library(gutenbergr)
## Now get the dataset
Treasure_Island <- gutenberg_works(title == "Treasure Island") %>% pull(gutenberg_id) %>%
gutenberg_download(.)
## and construct a toy arbeit:
arbeit <- data.frame(feature = c("island", "treasure", "to"),
tfidf = c(0.3,0.5,0.6))
## Break up a word into it's components (the head is just to keep the example short... you omit)
tidy_treasure <- unnest_tokens(Treasure_Island, feature, text, drop = FALSE) %>%
head(500)
## now bring the tfidf into tidy_treasure
df <- left_join(tidy_treasure, arbeit, by = "feature")
## and now you can average by sentence normally.
## To get the words we have to throw out the words that don't contribute to our tfidf.
## Two options:
df %>% filter(!is.na(tfidf)) %>% group_by(text) %>% summarize(AveTFIDF = sum(tfidf, na.rm = TRUE),
Words = paste(feature, collapse = ";"))
## Or if you want to keep a row for each found word, we can't use summarize, but we can still add them all up.
df %>% filter(!is.na(tfidf)) %>% group_by(text) %>% mutate(AveTFIDF = sum(tfidf, na.rm = TRUE))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.