简体   繁体   中英

Assigned value to list of a converted Date object differs from that of a returned value in function

I have a function which returns numerous attributes of an athlete, one of them being their birth date, through web scraping off the official IAAF athletics page. I've modified it slightly for the purposes of this question:

upscope_list <- list()

library(xml2)
library(tidyverse)
library(stringi)
library(rvest)

scrape_function_mod <- function(athlete_name) {
  
  starting_name <- stri_trans_general(athlete_name, "latin-ascii")
  
  initial_url <-
    paste0("https://www.iaaf.org/athletes/search?query=", starting_name)
  initial_search_page <- read_html(initial_url)
  
  rawnodes_text <-
    initial_search_page %>% html_nodes("table td") %>% html_text(trim = T) %>% stri_trans_general("latin-ascii")
  
  name_split <- as_vector(strsplit(starting_name, " ", fixed = T))
  number <- which(sapply(rawnodes_text, function(x)
    grepl(name_split[1], x, ignore.case = T) &
      grepl(name_split[length(name_split)], x, ignore.case = T)))
  
  upscope_list[[athlete_name]][["birth_date"]] <<-
    rawnodes_text[(number + 4)] %>% as.Date("%d %B %Y")
  
  return(rawnodes_text[(number + 4)] %>% as.Date("%d %B %Y"))
  
}

Most of the function code isn't that important except for the last two lines. If I run:

> scrape_function_mod("Ashton Eaton")
[1] "1988-01-21"

This returns a proper Date object of the athlete's birth date, however the value which I insert into the list created at the start differs, by returning a numeric four digit number which I can't make sense of.

> upscope_list[["Ashton Eaton"]][["birth_date"]]
[1] 6594

You can see that what I assign to the list compared to what I return should be virtually identical but it's not. Any tips to get it to convert properly inside the function?

As mentioned in the comment your date is converted to numeric

> as.numeric(as.Date("1988-01-21"))
[1] 6594

It is a known issue, see here:

The known issue can be best demonstrated with hadleys example: https://github.com/tidyverse/purrr/issues/358#issuecomment-363091446 :

> x <- list(as.Date("1988-01-21"))
> x[[1]]
[1] "1988-01-21"
> x[[c(1, 1)]]
[1] 6594

As you can see in the thread the issue was solved for pmap in purr r`. You could switch to that package or Maybe you are fine assigning the variable in a different way?

To understand the exact issue let's take a simple example.

Consider a list abc and let's assign a date object with name var1

Case 1:

abc <- list()
abc[["var1"]] <- Sys.Date()
abc
#$var1
#[1] "2019-11-28"

This works fine as expected. Even the class is "Date"

class(abc$var1)
#[1] "Date"

Now let's go one level deeper.

Case 2:

abc <- list()
abc[["var1"]][["var2"]] <- Sys.Date()
abc
#$var1
# var2 
#18228 

Here, date is converted into number. We all know that dates are internally stored as numbers but why did it work for the first one and not for the second one? Let's take another example

Case 3:

abc <- list()
abc[["var1"]] <- list()
abc[["var1"]][["var2"]] <- Sys.Date()
abc
#$var1
#$var1$var2
#[1] "2019-11-28"

Ahhh...I think we are now understanding the pattern and how it works. So from above cases it looks like you need to have the parent element in the list defined to maintain the Date class of element otherwise it coerces dates into numbers. But now let me make it more confusing/interesting/complicated.

Case 4:

abc <- list()
abc[["var1"]][["var2"]] <- c(Sys.Date(), Sys.Date())
abc
#$var1
#$var1$var2
#[1] "2019-11-28" "2019-11-28"

and now this works too. Here we didn't define var1 earlier but still it maintained the class of abc$var1$var2 . How ? Why ?

The answer to everything above is documented in ?Extract

When $<- is applied to a NULL x, it first coerces x to list(). This is what also happens with [[<- if the replacement value value is of length greater than one: if value has length 1 or 0, x is first coerced to a zero-length vector of the type of value.

So to summarise when you are assigning value of length more than 1, it will coerce x to list and then assign value. If you check class(abc$var1) in Case 2 it is of type of 'numeric' whereas in Case 3 and case 4 it is of type list . In case 3 it is of type list because abc$var1 is not NULL whereas in case 4 abc$var1 is NULL but it's length is more than 1.

Although the answer is very lengthy but I hope this explanation was helpful and easy to understand. Special thanks to @Roland who pointed me to the relevant help page.


Aside to OP for

Any tips to get it to convert properly inside the function?

Usually using <<- is not a good practice and there are lot of discussions around that available on the internet. But to solve the current problem you can add a line in your function

library(xml2)
library(tidyverse)
library(stringi)
library(rvest)

upscope_list <- list()
scrap_function_mod <- function(athlete_name) {

  starting_name <- stri_trans_general(athlete_name, "latin-ascii")

  initial_url <- paste0("https://www.iaaf.org/athletes/search?query=", starting_name)
  initial_search_page <- read_html(initial_url)

 rawnodes_text <-
initial_search_page %>% html_nodes("table td") %>% html_text(trim = T) %>% stri_trans_general("latin-ascii")

  name_split <- as_vector(strsplit(starting_name, " ", fixed = T))
  number <- which(sapply(rawnodes_text, function(x)
      grepl(name_split[1], x, ignore.case = T) &
      grepl(name_split[length(name_split)], x, ignore.case = T)))
  upscope_list[[athlete_name]] <<- list() #Added a line here
  upscope_list[[athlete_name]][["birth_date"]] <<-
      rawnodes_text[(number + 4)] %>% as.Date("%d %B %Y")

  return(rawnodes_text[(number + 4)] %>% as.Date("%d %B %Y"))

} 

Now when you call the function you get

scrap_function_mod("Ashton Eaton")
#[1] "1988-01-21"

upscope_list[["Ashton Eaton"]][["birth_date"]]
#[1] "1988-01-21"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM