K-Nearest Neighbors (KNN)

K-Nearest-Neighbors(KNN) is a machine learning tool used in classification or regression. Its usefulness is seen in economics and clustering based on distance. We can use KNN to predict titanic survivors.

First we’ll load in pandas and the titanic.csv.

We’ll drop the unnecessary columns from our data and assign it a new variable name, df.

We want the ‘Sex’ column to be ordinal, so we’ll make all our ‘male’ values a 0 and our ‘female’ values a 1.

After doing this we should check for missing values

With ‘Age’ having so many missing values, we should not drop all these rows, so let’s impute the missing values with the median age of all passengers. After doing that we’ll check to see what missing values we have left.

We only have 2 missing values left from the ‘Embarked’ column, so let’s drop na values.

Then we will can use OneHotEncoder or pandas get dummies to make all our columns ordinal. This allows us to work with the data more efficiently and allows us to make a regression model.

We are trying to predict survivors, so naturally our target feature is ‘Survived’. We will assign this to the variable ‘labels’. Now that we have our labels variable, we can drop the ‘Survived’ column from the rest of our data.

Now it’s time to split the data. We’ll use sklearn’s library for this. X is one_hot_df, y is labels, our test_size is 0.25 and our random_state is 42. Random_state is for reproduction of the results.

Once our data is split, we need to scale it. All our features are not on the same scale, so this could alter our results and lead to an inaccurate model.

Next, import KNeighborsClassifier from sklearn, fit the data, then predict the test data.

Using sklearns metrics, we can see the precision, accuracy, recall, and f1 score.

Finally, we can tune our model for the best value of k. Using a function that iterates through the minimum and maximum values of k, we can find out which value works best for the model.

So we see that a k-value of 15 leads to the highest f1 score and is the best value to use.

Data Scientist