Now you’re ready for sklearn

Like leetcodes but for ML - I should make that. There’s minimum like predictive performance requirements.


https://scikit-learn.org/stable/

What is scikit-learn?

Success scikit-learn

  • Abbreviated to sklearn
  • A Python library (package) containing lots of useful machine learning functionality.
  • sklearn gives you the ability to use any of the machine learning models covered in this section
  • You can literally train models and generate predictions in exactly 3 lines of code.

More fun facts:



Can you explain the structure of the sklearn package?

Make sure you understand how Python libraries are structured by answering this question.

scikit-learn (sklearn)

├── base
├── cluster
│   ├── AffinityPropagation
│   ├── AgglomerativeClustering
│   ├── Birch
│   ├── DBSCAN
│   ├── KMeans
│   ├── MiniBatchKMeans
│   ├── OPTICS
│   ├── SpectralClustering
│   └── ...
├── datasets
├── decomposition
├── ensemble
│   ├── AdaBoostClassifier
│   ├── AdaBoostRegressor
│   ├── BaggingClassifier
│   ├── BaggingRegressor
│   ├── GradientBoostingClassifier
│   ├── GradientBoostingRegressor
│   ├── RandomForestClassifier
│   ├── RandomForestRegressor
│   └── ...
├── feature_extraction
├── feature_selection
├── linear_model
│   ├── LinearRegression
│   ├── LogisticRegression
│   ├── Ridge
│   ├── Lasso
│   └── ...
├── metrics
├── model_selection
│   ├── cross_val_score
│   ├── train_test_split
│   ├── GridSearchCV
│   ├── RandomizedSearchCV
│   └── ...
├── naive_bayes
├── neighbors
│   ├── KNeighborsClassifier
│   ├── KNeighborsRegressor
│   └── ...
├── neural_network
│   ├── MLPClassifier
│   ├── MLPRegressor
│   └── ...
├── pipeline
├── preprocessing
│   ├── StandardScaler
│   ├── MinMaxScaler
│   ├── LabelEncoder
│   ├── OneHotEncoder
│   └── ...
├── svm
│   ├── SVC
│   ├── SVR
│   ├── LinearSVC
│   ├── LinearSVR
│   └── ...
├── tree
│   ├── DecisionTreeClassifier
│   ├── DecisionTreeRegressor
│   └── ...
└── utils

↳ By referring to this hierarchy, can you please write an import statement for importing and instantiating a Support Vector Classifier under the alias of AwesomeSVC?
from sklearn.svm import SVC as AwesomeSVC
 
myClassifer = AwesomeSVC() # then it can be instantiated like this

It’s a little bit obsfucated, but hopefully this clarifies the internal structure of sklearn.



In this question, we are going to test your sklearn competency. Your task is to train a random forest classifier to classify a Boba Tea as good or bad. First, attempt this Q/A thread to load in the alex-boba.csv dataset using pandas.

After completing this question, you have a dataset that looks like this in a boba DataFrame object.

Price ($)Volume (ml)Sugar Level (%)OrganicRating
4.743991.9NoGood
5.5847761.8YesBad
5.0154361.2NoBad
4.7258561.7YesBad
4.1244794.4YesBad

↳ Notice how in the rating column there are ‘good’ and ‘bad’ values. Instead of good/bad, can you replace it with a binary encoding of 0/1 using LabelEncoder()?
from sklearn.preprocessing import LabelEncoder
my_encoder = LabelEncoder() # Instantiate
boba['rating'] = my_encoder.fit_transform(boba['rating']) # Fit+Transform to the data

Success .fit()

  • The .fit() method is used for training or fitting the model to the given data. It computes the necessary parameters to apply a specific algorithm (like regression, classification, or clustering) to the data.
  • For instance, in machine learning models, .fit() will involve learning the weights on training data.
  • It does not transform or alter the data, it only learns from it.

Success .transform()

  • The .transform() method is used to apply a transformation or an action learned from the .fit() method to the data. It’s commonly used in preprocessing steps like scaling or normalizing data.
  • For example, after computing the mean and standard deviation of the features in the training set using .fit(), .transform() applies the scaling (subtracting the mean and dividing by the standard deviation) to any dataset to standardize features.

Success .fit_transform()

The .fit_transform() method combines the .fit() and .transform() methods into one step - because they are so commonly used together. It first fits the model with the data (learns parameters) and then transforms the data accordingly. This is more efficient and often used in the preprocessing phase, like scaling or normalizing data, where it first learns the parameters (like mean, standard deviation for scaling) and then applies this transformation to the dataset in one go. Remember, something even as simple as a binary label encoder is still technically considered a ‘model’, thus it makes sense to use `.fit_transform().


↳ Your next task is to split the boba dataset into two: the features and the response/target variable, represented by X and y respectively.

Success X and y

  • In machine learning, it is convention to represent the data as follows:
    • X: features
    • y: response variable
X = boba.drop('rating', axis=1) # not in place
y = boba['rating']              # otherwise this wouldn't work

↳ Why do you need to specify axis=1?
  • The method .drop() will, by default, drop a row instead of a column.
  • In pandas, axis=1 refers to columns, whereas axis=0 (the default) refers to rows.
  • By setting axis=1, you’re instructing pandas to drop a column, not a row. If you do not specify axis=1, pandas will look for a row with the label ‘quality’ to drop, which is not the intended action in this case.


(This question follows on from the previous section)

Create a test/train split of the boba dataset with 20% test, and then explain every parameter of train_test_split.
X_train, y_train, X_test, y_test = train_test_split(X, y, test_size=0.2, random_state=2904)
  1. X: The feature set of the dataset.
  2. y: The target variable.
  3. test_size=0.2: This parameter specifies the proportion of the dataset to include in the test split. 0.2 means 20% of the data will be used as the test set, and the remaining 80% will be the training set.
  4. random_state=2904: The split occurs randomly. This is a seed for the random number generator. Each time you run the code with this seed, you will get the same split of data.

↳ Why is it important that the test/train split is random, instead of just taking the last 20% of rows as the test set?
  • It’s crucial that the test/train split is random to ensure a representative and unbiased distribution of data between the training and test sets.
  • For example, consider a scenario where Alex initially visited less expensive boba shops and later, due to sponsorship, began visiting more boujee ones.
  • If the dataset were split sequentially with the last 20% as the test set, it would disproportionately represent boba from high-end shops.
  • This lack of randomness could lead to a biased model that does not generalize well across different types of boba shops.
  • Random splitting helps in creating more balanced and diverse subsets for training and testing, enhancing the model’s ability to perform consistently across varied data.


Most models require standardized inputs. Why?

Standardized inputs

  • Many machine learning algorithms assume that all features are standardized. This means that they have mean=0 and variance=1. Read more here.
  • If a feature’s variance is orders of magnitude more than the variance of other features, it might dominate the model’s learning, leading to less accurate predictions.

Standardized inputs also improve:

  • The rate of convergence for algorithms that use gradient descent as an optimization technique.
  • Balanced regularization: since it penalizes large weights
  • Distance-based algorithms such as k-nearest neighbours
  • Principal component analysis
  • Interpretability

Overall, features that are all on the same scale tend to be healthier input when it comes to all aspects of a machine learning model.

This is an extremely important preprocessing step.

It ensures that each feature contributes equally to the model’s learning process, preventing features with larger scales from dominating how the model learns.

Otherwise, the model may bias towards features that have larger values (due to their units) but not necessarily more importance. For example, volume (ml) is always going to have higher values than cost ($), but cost might be a more powerful feature.


↳ How do we standardize data in sklearn?

Success StandardScaler()

  • From the module sklearn.preprocessing

↳ What is wrong with this code, which is used to standardize the X_train and X_test feature columns?
sc = StandardScaler() # from sklearn.preprocessing
sc.fit_transform(X_train)
sc.fit_transform(X_test)

You should have spotted the following things:

  1. .fit_transform is a method of the StandardScaler instance, which does not transform X_train in place, but rather returns the transformed data. Thus, when this method is called, the return value does not get assigned to anything.
  2. The line sc.fit_transform(X_test) is wrong. You should have identified this as data leakage. The test data must be standardized using the mean and variance values fit from the training data. This ensures that the model has no prior knowledge of the test data during training, and that when we evaluate the model using this test data, it gives a fair indication of how it will perform on new, unseen data.

The corrected code:

sc = StandardScaler() # from sklearn.preprocessing
X_train = sc.fit_transform(X_train) # fits the mean/variance from the training data
X_test = sc.transform(X_test) # scales using the previous mean/variance


Okay, I’m happy with how you’ve preprocessed the data. Your next step is to train a model. Please explain how to do this in sklearn.

Training a model in sklearn is essentially always the same 3 steps (please remember this):

  1. Model Instantiation:

    • Write MyModel = ModelClass(), where ModelClass is your chosen sklearn model, like LinearRegression or RandomForestClassifier. This step creates an instance of the model class, sort of copy and pasting it from its template form
    • Set hyper-parameters at instantiation, e.g., RandomForestClassifier(n_estimators=100).
  2. Fitting the Model:

    • Apply the .fit() method on the model instance with X_train and y_train as arguments. Syntax: model.fit(X_train, y_train).
    • This method adjusts internal parameters based on data, effectively training the model.
  3. Predicting with the Model:

    • Use .predict() for generating predictions, passing X_test as the argument: model.predict(X_test).
    • This method uses the learned parameters to make predictions on new data.


Okay, using those previous steps, train a 3 different types of models, and explain how well suited each one is to the task.

Random Forest Classifier

rfc = RandomForestClassifier(n_estimators=200) # from sklearn.ensemble
rfc.fit(X_train, y_train)
predictions = rfc.predict(X_test)
  • RFC’s have less moving parts to fine tune
  • Usually start with a high n_estimators (trees) then bring it down, and see how performance was affected.
  • Good for medium sized datasets

Support Vector Machine

svm = SVC() # from sklearn.svm
svm.fit(X_train, y_train)
predictions = svm.predict(X_test)
  • Good for small sized datasets
  • Works better with raw and ready data

Multilayered Perceptron (Neural Network)

mlpc = MLPClassifier(hidden_layer_sizes=(12, 12, 12, 12), max_iter=500) # from sklearn.neural_network
mlpc.fit(X_train, y_train)
predictions = mlpc.predict(X_test)
  • Probably overkill for this task
  • Not enough data


We need to determine which model performed best. Can you explain and demonstrate some methods to evaluate model performance in scikit-learn?
  1. Accuracy Measurement:
  • Sometimes, for shareholder presentation, you just need something simple
  • For classification tasks, use accuracy_score. Syntax: accuracy_score(y_true, y_pred).
  • This function compares the actual labels (y_true) with the predicted labels (y_pred) and returns the proportion of correct predictions.
  1. Confusion Matrix:
  • Implement confusion_matrix. Syntax: confusion_matrix(y_true, y_pred).
  • Useful for classification tasks, it provides a detailed breakdown of correct and incorrect predictions for each class.
  1. Precision, Recall, and F1-Score:
  • Use precision_score, recall_score, and f1_score. Syntax: precision_score(y_true, y_pred), recall_score(y_true, y_pred), f1_score(y_true, y_pred).
  • These metrics provide insight into the balance between the true positives, false positives, and false negatives. Particularly useful when dealing with imbalanced datasets.
  1. ROC Curve and AUC:
  • For binary classification, utilize roc_auc_score and plot_roc_curve. Syntax: roc_auc_score(y_true, y_score), plot_roc_curve(model, X_test, y_test).
  • These methods assess model performance based on the trade-off between true positive rate and false positive rate.
  1. Mean Squared Error (MSE) and Mean Absolute Error (MAE):
  • For regression tasks, use mean_squared_error and mean_absolute_error. Syntax: mean_squared_error(y_true, y_pred), mean_absolute_error(y_true, y_pred).
  • MSE and MAE provide measures of model accuracy in predicting quantitative data.
  1. Cross-Validation:
  • Implement cross_val_score. Syntax: cross_val_score(model, X, y, cv=number_of_folds).
  • This function splits the data into a specified number of folds, training and evaluating the model on each fold. It provides a more robust assessment of model performance.

↳ Can you, step by step, interpret the following output of classification_report
              precision    recall  f1-score   support
 
           0       0.93      0.99      0.96       279
           1       0.91      0.51      0.66        41
 
    accuracy                           0.93       320
   macro avg       0.92      0.75      0.81       320
weighted avg       0.93      0.93      0.92       320

Support is simply the number of occurrences of each class in the test dataset

The top section (by class)

  • The report shows metrics for two classes (labelled 0 and 1).
  • Class 0:
    • Precision: 0.93 - When the model predicts class 0, it is correct 93% of the time.
    • Recall: 0.99 - The model correctly identifies 99% of all actual class 0 instances.
    • F1-Score: 0.96 - A harmonic mean of precision and recall, indicating a high balance between the two for class 0.
    • Support: 279 - The number of actual occurrences of class 0 in the dataset.
  • Class 1:
    • Precision: 0.91 - When the model predicts class 1, it is correct 91% of the time.
    • Recall: 0.51 - The model correctly identifies 51% of all actual class 1 instances.
    • F1-Score: 0.66 - The F1-score for class 1 is lower, suggesting a lower balance between precision and recall for this class.
    • Support: 41 - The number of actual occurrences of class 1 in the dataset.

The bottom section (overall model performance)

  • Accuracy: 0.93 - Overall, the model correctly predicts the class 93% of the time across both classes.
  • Macro Average:
    • Calculated by taking the average of the precision, recall, and F1-score for each class without considering support.
    • Precision: 0.92, Recall: 0.75, F1-Score: 0.81.
  • Weighted Average:
    • Similar to the macro average, but it also accounts for the support of each class. This gives a better representation of performance on imbalanced datasets.
    • Precision: 0.93, Recall: 0.93, F1-Score: 0.92.


How can you figure out which hyperparmaeters

Hyperparameter Tuning: Optionally, use GridSearchCV or RandomizedSearchCV for hyperparameter optimization.

How can you do more robust model validation? (I.e be more confident that you chose the correct model)
  • Cross-Validation: Implement cross-validation with cross_val_score or cross_validate for more robust model validation.
Alex has returned! He just visited a brand new boba store, Citadel Sips, and has given you all of his data (Price ($) | Volume (ml) | Sugar Level (%) | Organic ), but doesn’t want to tell you the rating yet. Using the model you have trained in sklearn, how can you predict what he rated the new boba?

Simply enter in the data as one row, don’t forget to transform it, and then call model.predict()

new_data = `[[8.99, 640, 33.3, 1]]`
new_data = sc.transform(new_data) # remember, the sc instance is still fit properly. 
prediction = mlpc.predict(new_data)

P.S: Alex rated it a 9/10!!!