從數據框中刪除非英語觀測值

Question

我有一個包含Twitter數據的數據框。 我已經清理了Tweet文本並將其添加為矢量clean_text ，但是有許多非英語語言的觀察會影響我的文本分析。 如何刪除數據框中所有非英語的觀察結果？

這是我的數據BrexitTweets的可復制示例。

structure(list(`Tweet ID` = c(746280472381107968, 746280472355929984, 
746280472154603008, 746280472129342976, 746280472083332992, 746280472037170944, 
746280471831645952, 746280471814888960, 746280471777185024, 746280471756180992, 
746280471743565056, 746280471705844992, 746280471680658944, 746280471676488960, 
746280471676455936, 746280471617757056, 746280471613570944, 746280471600992000, 
746280471525469952, 746280471403847040), Time = c("24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04"), `Tweet Type` = c("Tweet", "Retweet", 
"Retweet", "Retweet", "Retweet", "Retweet", "Tweet", "Retweet", 
"Tweet", "Retweet", "Tweet", "Tweet", "Retweet", "Tweet", "Retweet", 
"Retweet", "Retweet", "Tweet", "Retweet", "Retweet"), `Retweeted By` = c(NA, 
"misyed_", "Skuys", "priyadarshibbc", "Amaranta_2012", "ECCA_Nordic", 
NA, "Dat_Sync", NA, "SirDeGuz", NA, NA, "RoGreca_", NA, "30SecondsToMoon", 
"StuartGray", "DataDebate", NA, "alek_dev", "addi_GrBj"), `Number of Retweets` = c(0, 
251, 4, 14, 2, 39, 0, 6462, 0, 1391, 0, 0, 31595, 0, 27, 15, 
35, 0, 6462, 20521), `Number of Followers` = c(6079, 434717, 
16036, 345319, 4566, 3223810, 109145, 560, 78, 1957, 766, 1299, 
2155087, 235, 1925, 735, 8045, 159, 560, 128027), `Number Following` = c(2314, 
1994, 12403, 344855, 1012, 765, 333, 236, 132, 1407, 294, 1381, 
1, 338, 725, 1601, 831, 969, 236, 1606), clean_text = c("mayagoodfellow as always making sense of it all for us ive never felt less welcome in this country brexit  httpstcoiai5xa9ywv", 
"never underestimate power of stupid people in a democracy brexit", 
"gana el brexit reino unido decide abandonar la unión europea httpstco66cwudtsxu vía elmundoes", 
"uk prime minister set to resign brexit httpstco0bxbdmiswm", 
"oye junckereu que dice la ciudadanía de uk que tus tratados se los pasan por sus urnas brexit httpstcoedqfkl", 
"a quick guide to brexit and beyond after britain votes to quit eu httpstcos1xkzrumvg httpstcocniutojkt0", 
"this selfinflicted wound will be his legacy cameron falls on sword after brexit euref httpstcoegph3qonbj httpstcohbyhxodeda", 
"so the uk is out cameron resigned scotland wants to leave great britain sinn fein plans to unify ireland and its o", 
"this is a very good summary no biasspinagenda of the legal ramifications of the leave result brexit httpstcolobtyo48ng", 
"you cant make this up cornwall votes out immediately pleads to keep eu cash this was never a rehearsal httpstco", 
"brexit httpstconwutx2owcs", "brexit primer anàlisi de les conseqüencies en món de lesport httpstcon3bdrqz5cf via iusport unioesports", 
"no matter the outcome brexit polls demonstrate how quickly half of any population can be convinced to vote against itself q", 
"es ist nicht immer klug das volk entscheiden zu lassen brexit", 
"gli studenti europei verranno considerati extraeuropei e rimarranno senza assistenza sanitaria assurdo brexit", 
"i wouldnt mind so much but the result is based on a pack of lies and unaccountable promises democracy didnt win brexit pro", 
"brexit einfach erklärt httpstcou7jhlhrpim", "brexit httpstcoiive3hsj26", 
"so the uk is out cameron resigned scotland wants to leave great britain sinn fein plans to unify ireland and its o", 
"absolutely brilliant poll on brexit by yougov httpstcoepevg1moaw"
)), .Names = c("Tweet ID", "Time", "Tweet Type", "Retweeted By", 
"Number of Retweets", "Number of Followers", "Number Following", 
"clean_text"), row.names = c(NA, 20L), class = c("tbl_df", "tbl", 
"data.frame"))

Answer 1

簽出文字貓包

# install.packages("textcat") - install this package 
require(textcat)
require(dplyr)
data$Languages <- textcat(data$clean_text)
data <- data %>% filter(Languages == "english")

Answer 2

您可以嘗試使用grepl識別其文本包含非英文字符的grepl ：

nonmatch <- df[grepl("[^A-Za-z0-9[:punct:][:space:]]", df$text), ]

從數據框中刪除非英語觀測值

問題描述

2 個解決方案

解決方案1
2 已采納 2018-12-10 09:20:14

解決方案2
0 2018-12-10 09:15:14

從數據框中刪除非英語觀測值

問題描述

2 個解決方案

解決方案1 2 已采納 2018-12-10 09:20:14

解決方案2 0 2018-12-10 09:15:14

解決方案1
2 已采納 2018-12-10 09:20:14

解決方案2
0 2018-12-10 09:15:14