Predicting Flight Delays
Prior to the 2020 coronavirus pandemic, the United States airline industry was a steadily growing industry earning a revenue stream of $248 billion in 2019 alone. Millions more Americans were boarding planes each year; in 2019 811 million people boarded a US plane. In 2020, the airline industry suffered along with many others. Now in 2021, as air travel begins to pick up again, airlines need to regain the trust of travelers as being the best, most reliable airline. Now more than ever it is important that United States airline companies operate effectively by getting travelers from point A to point B with minimal delays.
When airlines are able to predict delays in advance they will have more time to come up with a solution for their travelers, such as buffering the next flight’s departure time or having an extra team and aircraft available at airports with an increased chance of delays. The model below is a tool designed for airline companies to use when building flight schedules in order to predict and plan for delays.
Flight Delays Predictor
- A model designed to predict if a commercial plane will be delayed.
- Based on origin and destination airport information, the aircraft, flight times and holiday season.
- Currently predicting with 75% accuracy.
Docker Pull: docker pull samanthaglasson/flight_delay_image
Model Construction:
Data Preprocessing
Exploratory data analysis and data preprocessing performed in postgreSQL:
- Imputed values for some nulls in ARRIVAL_DELAY using SCHEDULED_ARRIVAL - ARRIVAL_TIME.
- Dropped 92,513 rows (or 1.6% of the dataset) containing nulls in all 6 columns.
- Replaced the remaining 12,558 nulls by imputing AIR_TIME and ELAPSED_TIME based on data in other columns.
- Inserted values into LATITUDE and LONGITUDE columns for the airports with nulls.
- Joined data tables (flight data, airport data and holiday season calendar).
- Dropped data where the flight route had less than 700 flights.
- Used SQLAlchemy to import the joined data from pgAdmin into Jupyter Notebook.
Exploratory data analysis and data preprocessing performed in Jupyter Notebook:
- Encoded all categorical data using LabelEncoder.
- Dropped redundant columns.
- Created the target variable column, result15.
Visualizing Data
Model Building
Decision Tree Classifier
The classification reports displays the model’s precision, recall, F1 and suppport scores. Precision tells us what percent of the model’s predictions were correct and recall discloses what percent of the positive samples (delays) were caught/predicted.
At first, the model was predicting a moderately high accuracy, however, the data was very imbalanced. The model also carried features with no significant importance, but these were dropped to improve the model’s accuracy.
Steps to improve model
To address the imbalance of data, a sample of the majority class - on time data - was randomly taken. Additionally, a pipeline was constructed using Synthetic Minority Oversampling Technique, SMOTE, to duplicate examples within the minority class - delayed data - and randomly dropped rows in the majority class.
As you can see, these steps were successful in balancing the data.
The confusion matrix summarizes the percentage of correct and incorrect predictions the model has made by class. Roughly 27% of on time flights are being classified as ‘delayed’, reflecting a Type I error (false positives). Over 22% of delayed flights are being classified as ‘on time’, a Type II error (false negatives). Over 77% of the on time flights are predicted correctly (true negatives) and almost 73% of the flights are correctly predicted ‘delayed’ (true positives).
The ROC curve shows the trade-off between sensitivity and specificity. The ROC curve is constructed by plotting the rate of true positives (TPR), or the proportion of flights that were correctly predicted positive out of all positive predictions (delays), against the rate of false positives (FPR), or the proportion of observations that are incorrectly predicted positive out of all of the negative observations (on time flights).
Feature importances give us insight on which features are more or less relevant to predicting the target variable.
Next Steps:
Moving forward, I would like to implement data based on the weather averages at each airport based on the time of year. I would also like to review additional flight data from other years and build a model that takes into account other various factors or reasons for delays. With this data we may be able to construct a model that not only predicts delays, but also determines the duration of the delay.
Acknowledgements:
Oswald Vinueza & Connor Fryar for consistent guidance and encouragement as well as their willingness to run my code on their computers. Drew Jones for his patience and constructive feedback.