为什么 dplyr 的过滤器不能在简单过滤器中使用空格？

Question

I'm new to learning R, and I'm trying to explore a dataset provided by the R for Data Science online community for movies - https://github.com/rfordatascience/tidytuesday/blob/master/data/2018/2018-10-23/movie_profit.csv .我刚开始学习 R，我正在尝试探索 R 为电影数据科学在线社区提供的数据集 - https://github.com/rfordatascience/tidytuesday/blob/master/data/2018/2018 -10-23/movie_profit.csv 。

As I'm going through and learning more about the filter function of dplyr, I noticed that I do not get any results when I search for a value within the "distributor" column with a white space in it - like "Walt Disney" in the example below.当我浏览并了解有关 dplyr 过滤器 function 的更多信息时，我注意到当我在“分销商”列中搜索带有空格的值时，我没有得到任何结果 - 例如“Walt Disney”下面的例子。 Searching for values that do not have a white space works just fine, like "Universal".搜索没有空格的值效果很好，例如“通用”。

Also, I've tried with other columns in the dataset, like "movies" where I filter for a specific movie that has white-spaces in it;此外，我还尝试使用数据集中的其他列，例如“电影”，我在其中过滤包含空白的特定电影； however, when I do that I run into no issues so I'm a bit puzzled.但是，当我这样做时，我没有遇到任何问题，所以我有点困惑。

library(tidyverse)

movies <- read_csv(url("https://github.com/rfordatascience/tidytuesday/raw/master/data/2018/2018-10-23/movie_profit.csv")) 

test <- "20th Century Fox"

movies %>%
  filter(movie == "Dawn of the Planet of the Apes") %>%
  View()

In short, I'd love to know the why behind this so that I know how to handle this if it comes up again in any future datasets that I explore.简而言之，我很想知道这背后的原因，以便我知道如果它在我探索的任何未来数据集中再次出现时如何处理。 The code that is giving me trouble is below.给我带来麻烦的代码如下。 I want to plot the genre distribution of movies from the distributor "Walt Disney".我想要 plot 从发行商“Walt Disney”那里获得电影的流派发行。


movies <- read_csv(url("https://github.com/rfordatascience/tidytuesday/raw/master/data/2018/2018-10-23/movie_profit.csv")) 

test <- "20th Century Fox"

movies %>%
  filter(distributor == "Walt Disney") %>%
  ggplot(aes(x = genre)) +
  geom_bar()

Answer 1

There seems to a mismatch in the string字符串似乎不匹配

all.equal("Walt Disney", "Walt Disney" )
#[1] "1 string mismatch"

If we check the values如果我们检查值

unique(movies$distributor)[3]
#[1] "Walt Disney"

charToRaw(unique(movies$distributor)[3])
#[1] 57 61 6c 74 c2 a0 44 69 73 6e 65 79
charToRaw("Wald Disney")
#[1] 57 61 6c 64 20 44 69 73 6e 65 79

There is some difference triggered the mismatch有一些差异触发了不匹配

It is better to copy from the value in 'distributor最好从'distributor'中的值复制

library(dplyr)
library(ggplot2)
library(stringr)
movies %>%
    filter(str_detect(distributor, "Walt\\s+Disney")) %>%
   count(genre) %>%
   ggplot(aes(x = genre, y = n)) +
       geom_col()

-output -输出

Answer 2

The following uses agrepl for an approximate string match and it gives the graph in akrun's answer .以下使用agrepl进行近似字符串匹配，并在akrun's answer中给出图表。

movies %>% 
  filter(agrepl("Walt Disney", distributor)) %>%
  ggplot(aes(x = genre)) +
  geom_bar()

Answer 3

As mentioned in other posts, you have special characters instead of normal spaces in your tables.正如其他帖子中提到的，您的表中有特殊字符而不是普通空格。 You can replace them with regular spaces and your code should work as normal, without having to manually copy the strings.您可以将它们替换为常规空格，您的代码应该可以正常工作，而无需手动复制字符串。 I have added trimws to remove any trailing whitespace.我添加了trimws以删除任何尾随空格。 Note that this also removes other special characters.请注意，这也会删除其他特殊字符。

library(tidyverse)

movies <- read_csv(url("https://github.com/rfordatascience/tidytuesday/raw/master/data/2018/2018-10-23/movie_profit.csv")) 

# this line replaces non-alphanumeric characters with a space and removes any trailing whitespace at the end. 
movies$distributor <- trimws(gsub("[^[:alnum:]]", " ", movies$distributor))
movies %>%
  filter(distributor == "Walt Disney") %>%
  ggplot(aes(x = genre)) +
  geom_bar()

Answer 4

I'm new to learning R, and I'm trying to explore a dataset provided by the R for Data Science online community for movies - https://github.com/rfordatascience/tidytuesday/blob/master/data/2018/2018-10-23/movie_profit.csv . I'm new to learning R, and I'm trying to explore a dataset provided by the R for Data Science online community for movies - https://github.com/rfordatascience/tidytuesday/blob/master/data/2018/2018 -10-23/movie_profit.csv 。

As I'm going through and learning more about the filter function of dplyr, I noticed that I do not get any results when I search for a value within the "distributor" column with a white space in it - like "Walt Disney" in the example below.当我通过并了解有关过滤器 function 的 dplyr 的更多信息时，我注意到当我在“分销商”列中搜索带有空格的值时没有得到任何结果 - 例如下面的例子。 Searching for values that do not have a white space works just fine, like "Universal".搜索没有空格的值就可以了，比如“通用”。

Also, I've tried with other columns in the dataset, like "movies" where I filter for a specific movie that has white-spaces in it;此外，我已经尝试过使用数据集中的其他列，例如“电影”，我在其中过滤了其中包含空格的特定电影； however, when I do that I run into no issues so I'm a bit puzzled.但是，当我这样做时，我没有遇到任何问题，所以我有点困惑。

library(tidyverse)

movies <- read_csv(url("https://github.com/rfordatascience/tidytuesday/raw/master/data/2018/2018-10-23/movie_profit.csv")) 

test <- "20th Century Fox"

movies %>%
  filter(movie == "Dawn of the Planet of the Apes") %>%
  View()

In short, I'd love to know the why behind this so that I know how to handle this if it comes up again in any future datasets that I explore.简而言之，我很想知道这背后的原因，以便我知道如果它在我探索的任何未来数据集中再次出现时如何处理。 The code that is giving me trouble is below.给我带来麻烦的代码如下。 I want to plot the genre distribution of movies from the distributor "Walt Disney".我想 plot 从发行商“沃尔特迪斯尼”的电影流派分布。


movies <- read_csv(url("https://github.com/rfordatascience/tidytuesday/raw/master/data/2018/2018-10-23/movie_profit.csv")) 

test <- "20th Century Fox"

movies %>%
  filter(distributor == "Walt Disney") %>%
  ggplot(aes(x = genre)) +
  geom_bar()

为什么 dplyr 的过滤器不能在简单过滤器中使用空格？

问题描述

3 个解决方案

解决方案1
2 2020-06-06 19:00:27

解决方案2
2 2020-06-06 19:14:31

解决方案3
1 已采纳 2020-06-06 19:21:00

解决方案4
0 2020-06-06 19:08:48

为什么 dplyr 的过滤器不能在简单过滤器中使用空格？

问题描述

3 个解决方案

解决方案1 2 2020-06-06 19:00:27

解决方案2 2 2020-06-06 19:14:31

解决方案3 1 已采纳 2020-06-06 19:21:00

解决方案4 0 2020-06-06 19:08:48

解决方案1
2 2020-06-06 19:00:27

解决方案2
2 2020-06-06 19:14:31

解决方案3
1 已采纳 2020-06-06 19:21:00

解决方案4
0 2020-06-06 19:08:48