I am currently working on a project and I need some help. I want to predict the length of flight delays, using a statistical model. The data set does not contain the length of flight delays, but it can be calculated from the actual and scheduled departure times, I know that actual departure times - scheduled departure time will give me the flight delay which is the dependent variable. I am struggling to get the explanatory (independent) variables in a useful form to do regression analysis - the main problem is the time format of the first two columns when you read in the table from the csv file. I have attach the data file to the question because I wasn't too sure how to attach my file, I'm new to this coding thing hehe. Any help will be appreciated. xx
https://drive.google.com/file/d/11BXmJCB5UGEIRmVkM-yxPb_dHeD2CgXa/view?usp=sharing
EDIT:
Firstly Thank you for all the help
Okay I'm going to try and ask more precise questions on this topic:
So after importing the file using:
1)
Delays <- read.table("FlightDelaysSM.csv",header =T,sep=",")
2)The main issue I am having is getting the columns schedule time and deptime into a format where I can do arithmetic calculation
3)I tried the below
Delays[,1] - Delays[,2]
where the obvious issue arises for example 800 (8am) - 756 (7.56am) = 44 not 4 minutes
4)Using the help from @kerry Jackson (thank you, you're amazing x) I tried
DepartureTime <- strptime(formatC(Delays$deptime, width = 4, format = "d", flag = "0", %H%M)
ScheduleTime <- strptime(formatC(Delays$schedtime, width = 4, format = "d", flag = "0", %H%M)
DelayTime = DepartureTime - ScheduleTime
The values are also given are in seconds, I want the difference to be in minutes how would I go about doing this?
5) I then did the following:
DelayData <- data.frame(ScheduleTime, DepartureTime, DelayTime, Delays[, 4:7])
What I attain after making the DelayData
as you can see by the image I have the seconds units in my column called DelayTime which I don't want as stated in 4), and the date is in the columns ScheduleTime and DepartureTime could I possibly get some suggestions on how to correct this?
Create a new column called flight_delay
:
install.packages('tidyverse')
library(tidyverse)
your_data <- your_data %>%
mutate(flight_delay=deptime-schedtime)
Now, create a linear regression model predicting flight_delay
by every other variable:
mod <- lm(flight_delay ~ ., data=your_data)
To optimize your model, use the step
function:
mod <- step(mod)
Analyze results:
summary(mod)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.