简体   繁体   中英

Regression Analysis

I am currently working on a project and I need some help. I want to predict the length of flight delays, using a statistical model. The data set does not contain the length of flight delays, but it can be calculated from the actual and scheduled departure times, I know that actual departure times - scheduled departure time will give me the flight delay which is the dependent variable. I am struggling to get the explanatory (independent) variables in a useful form to do regression analysis - the main problem is the time format of the first two columns when you read in the table from the csv file. I have attach the data file to the question because I wasn't too sure how to attach my file, I'm new to this coding thing hehe. Any help will be appreciated. xx

https://drive.google.com/file/d/11BXmJCB5UGEIRmVkM-yxPb_dHeD2CgXa/view?usp=sharing

EDIT:

Firstly Thank you for all the help

Okay I'm going to try and ask more precise questions on this topic:

So after importing the file using:

1)

    Delays <- read.table("FlightDelaysSM.csv",header =T,sep=",") 

2)The main issue I am having is getting the columns schedule time and deptime into a format where I can do arithmetic calculation

3)I tried the below

    Delays[,1] - Delays[,2] 

where the obvious issue arises for example 800 (8am) - 756 (7.56am) = 44 not 4 minutes

4)Using the help from @kerry Jackson (thank you, you're amazing x) I tried

    DepartureTime <- strptime(formatC(Delays$deptime, width = 4, format = "d", flag = "0", %H%M)

    ScheduleTime <- strptime(formatC(Delays$schedtime, width = 4, format = "d", flag = "0", %H%M)

    DelayTime = DepartureTime - ScheduleTime

The values are also given are in seconds, I want the difference to be in minutes how would I go about doing this?

5) I then did the following:

    DelayData <- data.frame(ScheduleTime, DepartureTime, DelayTime, Delays[, 4:7])

What I attain after making the DelayData

as you can see by the image I have the seconds units in my column called DelayTime which I don't want as stated in 4), and the date is in the columns ScheduleTime and DepartureTime could I possibly get some suggestions on how to correct this?

Create a new column called flight_delay :

install.packages('tidyverse')
library(tidyverse)

your_data <- your_data %>%
  mutate(flight_delay=deptime-schedtime)

Now, create a linear regression model predicting flight_delay by every other variable:

mod <- lm(flight_delay ~ ., data=your_data)

To optimize your model, use the step function:

mod <- step(mod)

Analyze results:

summary(mod)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM