• Home
  • About
    • Christopher Lin photo

      Christopher Lin

      Hello, welcome to my personal and professional blog.

    • Learn More
    • Email
    • LinkedIn
    • Github
  • Experience
    • Work Experience
  • Project Portfolio

Kaggle: Spaceship Titanic (LightGBM, CatBoost, Vote Classifier)

April 16, 2023

Introduction

This is a kaggle competition to predict which passengers are transported to an alternate dimension. The data source is from https://www.kaggle.com/competitions/spaceship-titanic

I blended predictive results using Vote Classfifier from two algorithms – LGBM, and CatBoost – to maximize accuracy of the prediction.

Achievement

  • R-squared for test dataset after submission was 0.80897
  • Rank: 133/2504 (Top 5% as of April 17, 2023)

Datasets

  • 12 variables for each passenger and a column for succesfully transported or not (y variable). Total 8,693 rows for training set, 4,277 rows for test set.

  • After data cleaning, I kept all rows and imputed missing values.

Language and libraries

Language : Python

Libraries :

  • Data Manipulation: pandas, numpy
  • Data Visualization: matplotlib, seaborn, msno
  • Machine Learning: sklearn, xgboost, lightgbm, catboost

Data Preprocessing 1

  • Insights :
    • Missing data is not in order

Check percetage and distribution of missing values (i.e. Data Sparcity):

  • By missingno visualization, there is no specific trend or pattern in the distribution of missing values.

  • There are no features with missing values over 5%.

Separate “PasssengerID”

  • Extract number after “_” in PassegerID columns as Family number
  • e.g. original foramt “9280_02” –> “9280” as PassengerID & “02” as Family number

Separate “Cabin”

  • Split Cabin columns into Cabin_deck, Cabin_num, Cabin_side

Convert to correct data type

Categorical

  • HomePlanet: Earth, Europa, Mars
  • Cabin: 3 components - cabinet deck / number / side
  • Destination: 3 in total
  • Cabin_deck: categorize by letters
  • Cabin_side: P or S

Numerical

  • Age: not normal distribution
  • RoomService: need transformation (highly right skew)
  • FoodCourt: need transformation (highly right skew)
  • ShoppingMall: need transformation (highly right skew)
  • Spa: need transformation (highly right skew)
  • VRDeck: need transformation (highly right skew)
  • Cabin_num

Boolean

  • CryoSleep
  • VIP: imbalanced (most of False, but should be fine)

New data information look like below.

EDA (Exploratory Data Analysis) and Feature Selection

  • Visualize correlation between all features

There is no strong correlation between independent and dependent variables, so it may not achieve a good performance if use linear regression only.

  • Check the pairplot

There might need some transformation if using Linear regression models.

Feature Engineering

  • Lux_exp: sum up of the uneccessary spending such as RoomService, Spa, and VRDeck. This may indirectly represent wealthiness.
  • Total_exp: sum up of Lux_exp, FoodCourt, and ShoppingMall. This is another feature representing wealthiness.
  • costly: split Total_exp into 2 levels (‘costly_False’, ‘costly_True’) by quantile for imputing missing values of CryoSleep
  • CryoSleep: Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. It might be an important feature because the suspended animation may affect significantly in the process of tele-transportation. Impute missing values as 0 when passengers are not rich (i.e. costly_False in costly) and as 1 when passengers are rich.

Data Preprocessing 2 - Deal with missing values in pipeline

  • Impute with most frequent values
    • HomePlanet, VIP, Cabin_deck, Cabin_side, costly
  • Impute with least frequent values:
    • Destination (‘PSO J318.5-22’) as both nan and this value fail to transport
  • Impute with median
    • Age, RoomService, FoodCourt, ShoppingMall, Spa, VRDeck, Fam_num, Cabin_num, Lux_exp, Total_exp
  • Impute with feature engineering
    • CryoSleep

Model Build-up

  • Split Data: 85% training set, 15% test set (validation actually)

  • Pipeline
  • Feature pipeline: use pipeline to connect data preprocessing among different features
  • Drop features
  • Numerical features: simple impute to median
  • Categorical features for most frequent: simple impute to most frequent value, and do one-hot encoding
  • Categorical features for least frequent: simple impute to most frequent value, and do one-hot encoding
  • Binary features: impute with 0 and do one-hot encoding

  • Model pipeline: After building up feature pipelines, I connected them to the model or algorithms, so I don’t need to do the same task again on test data set.

Model Selection

After trying a bunch of algorithm, the Light GBM, and Catboost perform better.

I then decide to combine them together to see if it is possible to achieve better accuracy.

It turns out that the validation score is better than other two. After submitting, however, the results from catboost and LightGBM is better than that from voting classifier. It might be because of data leakage when training those algorithms.

Evaluation

  • R-squared: all the scores is evaluated based on R-squared.
  • Feature importances on Catboost:

It seems CryoSleep is not the most important but still the top 7th.

Potential future improvement

  • Semi-supervised learning: when imputing the missing value, doing semi-supervised learning may be helpful. It can prevent from losing information and still relatively maintain the accuracy.
  • Write another data preprocessor for Catboost algorithm: Writing another data preprocessor without one-hot encoding might be helpful for Catboost because catboost algorithm has a built-in function for categorical features. Leaving this step to a well designed algorith might be better.

Takeaways

  • First try to blend results from different models
  • First try using advanced Gradient Boosting algorithm other that XGB
  • Voting Classifier is not helpfull all the time: When combining multiple algorithms with Voting classifier, it might not be always good because some algorithms outperform others or because the beneifts of algorithms would cancel out each other. In this case, those single algorithms such as Catboost and LightGBM perform better than Voting classifier.

Code

For code detail, you can check Kaggle here or Github here.