简体   繁体   English

为什么 dplyr 的过滤器不能在简单过滤器中使用空格?

[英]Why is dplyr's filter not working with white spaces in simple filter?

I'm new to learning R, and I'm trying to explore a dataset provided by the R for Data Science online community for movies - https://github.com/rfordatascience/tidytuesday/blob/master/data/2018/2018-10-23/movie_profit.csv .我刚开始学习 R,我正在尝试探索 R 为电影数据科学在线社区提供的数据集 - https://github.com/rfordatascience/tidytuesday/blob/master/data/2018/2018 -10-23/movie_profit.csv

As I'm going through and learning more about the filter function of dplyr, I noticed that I do not get any results when I search for a value within the "distributor" column with a white space in it - like "Walt Disney" in the example below.当我浏览并了解有关 dplyr 过滤器 function 的更多信息时,我注意到当我在“分销商”列中搜索带有空格的值时,我没有得到任何结果 - 例如“Walt Disney”下面的例子。 Searching for values that do not have a white space works just fine, like "Universal".搜索没有空格的值效果很好,例如“通用”。

Also, I've tried with other columns in the dataset, like "movies" where I filter for a specific movie that has white-spaces in it;此外,我还尝试使用数据集中的其他列,例如“电影”,我在其中过滤包含空白的特定电影; however, when I do that I run into no issues so I'm a bit puzzled.但是,当我这样做时,我没有遇到任何问题,所以我有点困惑。

library(tidyverse)

movies <- read_csv(url("https://github.com/rfordatascience/tidytuesday/raw/master/data/2018/2018-10-23/movie_profit.csv")) 

test <- "20th Century Fox"

movies %>%
  filter(movie == "Dawn of the Planet of the Apes") %>%
  View()

In short, I'd love to know the why behind this so that I know how to handle this if it comes up again in any future datasets that I explore.简而言之,我很想知道这背后的原因,以便我知道如果它在我探索的任何未来数据集中再次出现时如何处理。 The code that is giving me trouble is below.给我带来麻烦的代码如下。 I want to plot the genre distribution of movies from the distributor "Walt Disney".我想要 plot 从发行商“Walt Disney”那里获得电影的流派发行。


movies <- read_csv(url("https://github.com/rfordatascience/tidytuesday/raw/master/data/2018/2018-10-23/movie_profit.csv")) 

test <- "20th Century Fox"

movies %>%
  filter(distributor == "Walt Disney") %>%
  ggplot(aes(x = genre)) +
  geom_bar()

There seems to a mismatch in the string字符串似乎不匹配

all.equal("Walt Disney", "Walt Disney" )
#[1] "1 string mismatch"

If we check the values如果我们检查值

unique(movies$distributor)[3]
#[1] "Walt Disney"

charToRaw(unique(movies$distributor)[3])
#[1] 57 61 6c 74 c2 a0 44 69 73 6e 65 79
charToRaw("Wald Disney")
#[1] 57 61 6c 64 20 44 69 73 6e 65 79

There is some difference triggered the mismatch有一些差异触发了不匹配

It is better to copy from the value in 'distributor最好从'distributor'中的值复制

library(dplyr)
library(ggplot2)
library(stringr)
movies %>%
    filter(str_detect(distributor, "Walt\\s+Disney")) %>%
   count(genre) %>%
   ggplot(aes(x = genre, y = n)) +
       geom_col()

-output -输出

在此处输入图像描述

The following uses agrepl for an approximate string match and it gives the graph in akrun's answer .以下使用agrepl进行近似字符串匹配,并在akrun's answer中给出图表。

movies %>% 
  filter(agrepl("Walt Disney", distributor)) %>%
  ggplot(aes(x = genre)) +
  geom_bar()

As mentioned in other posts, you have special characters instead of normal spaces in your tables.正如其他帖子中提到的,您的表中有特殊字符而不是普通空格。 You can replace them with regular spaces and your code should work as normal, without having to manually copy the strings.您可以将它们替换为常规空格,您的代码应该可以正常工作,而无需手动复制字符串。 I have added trimws to remove any trailing whitespace.我添加了trimws以删除任何尾随空格。 Note that this also removes other special characters.请注意,这也会删除其他特殊字符。

library(tidyverse)

movies <- read_csv(url("https://github.com/rfordatascience/tidytuesday/raw/master/data/2018/2018-10-23/movie_profit.csv")) 

# this line replaces non-alphanumeric characters with a space and removes any trailing whitespace at the end. 
movies$distributor <- trimws(gsub("[^[:alnum:]]", " ", movies$distributor))
movies %>%
  filter(distributor == "Walt Disney") %>%
  ggplot(aes(x = genre)) +
  geom_bar()

I'm new to learning R, and I'm trying to explore a dataset provided by the R for Data Science online community for movies - https://github.com/rfordatascience/tidytuesday/blob/master/data/2018/2018-10-23/movie_profit.csv . I'm new to learning R, and I'm trying to explore a dataset provided by the R for Data Science online community for movies - https://github.com/rfordatascience/tidytuesday/blob/master/data/2018/2018 -10-23/movie_profit.csv

As I'm going through and learning more about the filter function of dplyr, I noticed that I do not get any results when I search for a value within the "distributor" column with a white space in it - like "Walt Disney" in the example below.当我通过并了解有关过滤器 function 的 dplyr 的更多信息时,我注意到当我在“分销商”列中搜索带有空格的值时没有得到任何结果 - 例如下面的例子。 Searching for values that do not have a white space works just fine, like "Universal".搜索没有空格的值就可以了,比如“通用”。

Also, I've tried with other columns in the dataset, like "movies" where I filter for a specific movie that has white-spaces in it;此外,我已经尝试过使用数据集中的其他列,例如“电影”,我在其中过滤了其中包含空格的特定电影; however, when I do that I run into no issues so I'm a bit puzzled.但是,当我这样做时,我没有遇到任何问题,所以我有点困惑。

library(tidyverse)

movies <- read_csv(url("https://github.com/rfordatascience/tidytuesday/raw/master/data/2018/2018-10-23/movie_profit.csv")) 

test <- "20th Century Fox"

movies %>%
  filter(movie == "Dawn of the Planet of the Apes") %>%
  View()

In short, I'd love to know the why behind this so that I know how to handle this if it comes up again in any future datasets that I explore.简而言之,我很想知道这背后的原因,以便我知道如果它在我探索的任何未来数据集中再次出现时如何处理。 The code that is giving me trouble is below.给我带来麻烦的代码如下。 I want to plot the genre distribution of movies from the distributor "Walt Disney".我想 plot 从发行商“沃尔特迪斯尼”的电影流派分布。


movies <- read_csv(url("https://github.com/rfordatascience/tidytuesday/raw/master/data/2018/2018-10-23/movie_profit.csv")) 

test <- "20th Century Fox"

movies %>%
  filter(distributor == "Walt Disney") %>%
  ggplot(aes(x = genre)) +
  geom_bar()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM