简体   繁体   中英

Extracting date information from a wikipedia table in r

I have scraped the following tables from wikipedia using the XML package:

http://en.wikipedia.org/wiki/2014_FIFA_World_Cup_squads

You'll notice that the dob variable is as follows on the webpage: 4 January 1985 (aged 29)

This reads in my R dataframe as follows: (1985-01-04)4 January 1985 (aged 29)

It is treated in R in the scraped data as a factor, not a date.

I am trying to create a variable that simply has the dob in the YYYY-MM-DD format, but am having trouble reformatting the 'dob' variable as such.

I've tried the following without success (my dataframe is called alpha):

alpha$newvar <- as.Date(alpha$dob, "%Y%m%d")
alpha$newvar <- strptime(alpha$dob,format="%Y%m%d")

Here are sample data for the South Korean squad:

structure(list(no = structure(c(1L, 12L, 17L, 18L, 19L, 20L, 
21L, 22L, 23L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 13L, 
14L, 15L, 16L), .Label = c("1", "10", "11", "12", "13", "14", 
"15", "16", "17", "18", "19", "2", "20", "21", "22", "23", "3", 
"4", "5", "6", "7", "8", "9"), class = "factor"), pos = structure(c(1L, 
2L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 2L, 3L, 3L, 3L, 3L, 3L, 
4L, 4L, 2L, 1L, 2L, 1L), .Label = c("1GK", "2DF", "3MF", "4FW"
), class = "factor"), player = structure(c(6L, 9L, 23L, 14L, 
12L, 4L, 8L, 1L, 22L, 19L, 17L, 18L, 13L, 2L, 20L, 7L, 16L, 11L, 
5L, 3L, 10L, 21L, 15L), .Label = c("Ha Dae-sung", "Han Kook-young", 
"Hong Jeong-ho", "Hwang Seok-ho", "Ji Dong-won", "Jung Sung-ryong", 
"Ki Sung-yueng", "Kim Bo-kyung", "Kim Chang-soo", "Kim Seung-gyu", 
"Kim Shin-wook", "Kim Young-gwon", "Koo Ja-cheol (c)", "Kwak Tae-hwi", 
"Lee Bum-young", "Lee Chung-yong", "Lee Keun-ho", "Lee Yong", 
"Park Chu-young", "Park Jong-woo", "Park Joo-ho[67]", "Son Heung-min", 
"Yun Suk-young"), class = "factor"), dob = structure(c(2L, 6L, 
18L, 1L, 19L, 15L, 17L, 3L, 23L, 5L, 4L, 7L, 12L, 20L, 13L, 11L, 
10L, 9L, 22L, 16L, 21L, 8L, 14L), .Label = c("(1981-07-08)8 July 1981 (aged 32)", 
"(1985-01-04)4 January 1985 (aged 29)", "(1985-03-02)2 March 1985 (aged 29)", 
"(1985-04-11)11 April 1985 (aged 29)", "(1985-07-10)10 July 1985 (aged 28)", 
"(1985-09-12)12 September 1985 (aged 28)", "(1986-12-24)24 December 1986 (aged 27)", 
"(1987-01-16)16 January 1987 (aged 27)", "(1988-04-14)14 April 1988 (aged 26)", 
"(1988-07-02)2 July 1988 (aged 25)", "(1989-01-24)24 January 1989 (aged 25)", 
"(1989-02-27)27 February 1989 (aged 25)", "(1989-03-10)10 March 1989 (aged 25)", 
"(1989-04-02)2 April 1989 (aged 25)", "(1989-06-27)27 June 1989 (aged 24)", 
"(1989-08-12)12 August 1989 (aged 24)", "(1989-10-06)6 October 1989 (aged 24)", 
"(1990-02-13)13 February 1990 (aged 24)", "(1990-02-27)27 February 1990 (aged 24)", 
"(1990-04-19)19 April 1990 (aged 24)", "(1990-09-30)30 September 1990 (aged 23)", 
"(1991-05-28)28 May 1991 (aged 23)", "(1992-07-08)8 July 1992 (aged 21)"
 ), class = "factor"), caps = structure(c(17L, 20L, 13L, 11L, 
6L, 10L, 9L, 4L, 7L, 19L, 18L, 3L, 12L, 2L, 2L, 16L, 15L, 8L, 
9L, 7L, 14L, 5L, 1L), .Label = c("0", "10", "12", "13", "14", 
"21", "25", "27", "28", "3", "35", "37", "4", "5", "55", "58", 
"61", "63", "64", "9"), class = "factor"), club = structure(c(16L, 
10L, 12L, 1L, 8L, 13L, 6L, 3L, 2L, 18L, 14L, 17L, 11L, 10L, 9L, 
15L, 4L, 17L, 7L, 7L, 17L, 11L, 5L), .Label = c("Al-Hilal", "Bayer Leverkusen", 
"Beijing Guoan", "Bolton Wanderers", "Busan IPark", "Cardiff City", 
"FC Augsburg", "Guangzhou Evergrande", "Guangzhou R&F", "Kashiwa Reysol", 
"Mainz 05", "Queens Park Rangers", "Sanfrecce Hiroshima", "Sangju Sangmu", 
"Sunderland", "Suwon Bluewings", "Ulsan Hyundai", "Watford"), class = "factor")),      .Names = c("no", 
"pos", "player", "dob", "caps", "club"), row.names = c(NA, -23L
), class = "data.frame")

I can answer my own question. The issue was that to correctly tell R the date format, it had to know that the date was contained within brackets.

so,

as.character(strptime(alpha$dob, format = "(%Y-%m-%d)"))

putting "(%Y-%m-%d)" as the format gets R to search the character string for the date format inside brackets.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM