简体   繁体   中英

parsing unstructured files with r

I'm trying to parse this unstructured file using R

ftp://ftp.fu-berlin.de/pub/misc/movies/database/genres.list.gz

Deadpoint (2012)     Action
Deadpoint (2012)     Drama
Deadpoint (2012)     Short
Deadpoint (2016)     Action
Deadpoint (2016)     Adventure
Deadpoint (2016)     Drama
Deadpoint (2016)     Horror
Deadpoint (2016)     Short
Deadpool (2013) (VG)     Action
Deadpool (2013) (VG)     Comedy
Deadpool (2013) (VG)     Fantasy
Deadpool (2016)      Action
Deadpool (2016)      Adventure
Deadpool (2016)      Comedy
Deadpool (2016)      Romance
Deadpool (2016)      Sci-Fi
Deadpool 2 (2018)     Action
Deadpool 2 (2018)     Adventure
Deadpool 2 (2018)     Comedy
Deadpool 2 (2018)     Fantasy

I posted the sample as code as I cannot post it in the proper format over here but it's movie title (including year), a VARIABLE number of tabs and a one word Genre.

I want to capture the movie title in 1 column and the genre in the last column. With regex I would do it like this:

^(.*?)\t+(\S+)$

I tried to read_lines from the gzip and gsub("\\t+","\\t",lines) but then read.table would not read the cleaned variable.

read.table(lines, header = FALSE, sep = "\t", quote = "\"", fill = TRUE, comment.char = "", skip=380)

Using the code above I get the movie title in the first column and the genre in one of 6 further columns according to the number of tabs the line has. Any ideas on alternatives how to get this done.

Assuming you read up the file into an array called line, try the below. It implements your regex with R's little quirks adjusted. Your regex doesn't take into account that some films have stuff between the year and the genre (eg "Angel City (2011) {{SUSPENDED}} Drama"), but it doesn't occur that often.

line <- gsub('\"', '', line)     # delete quotes
line <- gsub('\\t',' ', line)    # tabs into spaces
line <- gsub(' {2,}', ' ', line) # delete extra spaces
line <- regmatches(line,regexpr('^(.*?)\\s(\\S+)$',line))

It takes a while to run through 2.3 million lines, but it works

The following code splits the data into 2 columns.

I use tibble and tidyr to split the data.

library(readr)
library(tidyr)
library(tibble)

data <- read_lines("genres.list.gz", skip = 380)

data <- gsub('\"', '', data)     # delete quotes
data <- gsub('\\t+','~', data)    # replace tabs with a ~ 


movies <- data %>% 
  # turn into tibble data_frame. avoids stringsAsFactors = FALSE 
  # name column movie 
  data_frame(movie = .) %>%  
  # split movie column based on "~"
  separate(movie, c("movie", "genre"), "~", extra = "merge")

#clean up workspace
rm(lines) 

head(movies)
# A tibble: 6 x 2
                     movie       genre
                     <chr>       <chr>
1            !Next? (1994) Documentary
2         #1 Single (2006)  Reality-TV
3    #15SecondScare (2015)      Horror
4    #15SecondScare (2015)       Short
5    #15SecondScare (2015)    Thriller
6 #1MinuteNightmare (2014)      Horror

I have come up with the following solution

genres<-read.table(genres.list.gz, header = FALSE, sep = "\t", quote = "\"", fill = TRUE, comment.char = "", skip=starts+2)
x<-paste(genres[,2],genres[,3],genres[,4],genres[,5],genres[,6],genres[,7])
x<-gsub("\\s+", "", x)
genres[,2:7]<-NULL
genres[,2]<-x
names(genres) <- c("Title", "Genre")

First I create a vector x with the genre columns all concatenated, Second I remove all whitespace from all entries

Then I remove all columns from the genres data.frame and set x as the 2nd column

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM