Solving the Space Titanic Kaggle Competition
First steps
As a first step, I had to look over some articles about data science and how to process the data. Learned some new concepts about models and how to train them. I was also studying various data preprocessing techniques, exploring feature engineering strategies, and understanding the types of different machine learning algorithms.
Understanding the Dataset
The dataset, comprising personal records of passengers, is split into two parts: the training data (found in 'train.csv') consisting of approximately two-thirds (around 8700) of the passengers and the test data (in 'test.csv') with the remaining one-third (approximately 4300) of the passengers. Each entry in the dataset is associated with a unique PassengerId, indicating the group the passenger is traveling with and their position within the group. Key attributes, such as HomePlanet, CryoSleep, Cabin, Destination, Age, VIP status, and various onboard amenities' billing details, provide essential insights into each passenger's profile.
Dataset Attributes
PassengerId: Unique identifier for each passenger, denoting the passenger's group and position within the group.
HomePlanet: Denotes the planet of permanent residence for each passenger, reflecting their point of departure.
CryoSleep: Indicates whether the passenger chose to undergo suspended animation during the voyage, confining them to their cabins.
Cabin: Specifies the cabin number where the passenger is lodged, delineated by deck/number/side, where the side can be 'P' for Port or 'S' for Starboard.
Destination: Specifies the planet where the passenger is scheduled to disembark.
Age: Reflects the age of each passenger.
VIP: Indicates whether the passenger availed special VIP services during the journey.
RoomService, FoodCourt, ShoppingMall, Spa, VRDeck: Reflect the respective billing amounts for each passenger at various luxury amenities on the Spaceship Titanic.
Name: Records the first and last names of each passenger.
Transported: Represents the critical target attribute, denoting whether the passenger was transported to another dimension during the collision.
Data Preprocessing and Feature Engineering
The program begins by reading the data from the 'train.csv' file using the Pandas library. It then proceeds to analyze the data types, identifying categorical, numerical, and boolean attributes. Furthermore, it assesses the presence of null values in the dataset, and strategically handles them by imputing the missing values using medians. Additionally, the program converts categorical attributes to numerical ones. For instance, the 'Cabin' column is utilized to create the 'Deck' and 'Port' features. The 'Deck' and 'Port' attributes are mapped to numeric values based on specific mappings, and the 'Cabin' column is removed from the dataset. Similar transformations are applied to other categorical attributes like 'HomePlanet', 'Destination', 'VIP', and 'CryoSleep' in order for them to be used in the models selected.
Model Training and Evaluation
The processed data is then split into training and testing sets using the 'train_test_split' function from the 'sklearn.model_selection' module. Subsequently, I implemented various classification models and evaluated them, including Logistic Regression, Random Forest Classifier, Gradient Boosting Classifier, Support Vector Machine (SVM), Gaussian Naive Bayes, K-Nearest Neighbors, and a Multi-layer Perceptron Neural Network. The accuracy of each model is computed using the 'score' method. This was the last step I was able to do in these 3 weeks. After that, I tried using GridSearchCV in order to tune the parameters for the Random Forest Classifier. In the end, I ended up with a prediction score of 78.78%, however there's still space for improvement.
Future Work
In the future I will like to try and implement a multi-layer neural network to see how it would perform.
Materials that helped me get here
Courses from this class (Machine Learning)
Niciun comentariu:
Trimiteți un comentariu