Now you’re ready for sklearn
Like leetcodes but for ML - I should make that. There’s minimum like predictive performance requirements.
https://scikit-learn.org/stable/
What is scikit-learn?
Success
scikit-learn
- Abbreviated to
sklearn- A Python library (package) containing lots of useful machine learning functionality.
sklearngives you the ability to use any of the machine learning models covered in this section- You can literally train models and generate predictions in exactly 3 lines of code.
More fun facts:
- Built on numpy, scipy, and matplotlib
- Open source, commerically usable under the BSD License
Can you explain the structure of the sklearn package?
Make sure you understand how Python libraries are structured by answering this question.
scikit-learn (sklearn)
│
├── base
├── cluster
│ ├── AffinityPropagation
│ ├── AgglomerativeClustering
│ ├── Birch
│ ├── DBSCAN
│ ├── KMeans
│ ├── MiniBatchKMeans
│ ├── OPTICS
│ ├── SpectralClustering
│ └── ...
├── datasets
├── decomposition
├── ensemble
│ ├── AdaBoostClassifier
│ ├── AdaBoostRegressor
│ ├── BaggingClassifier
│ ├── BaggingRegressor
│ ├── GradientBoostingClassifier
│ ├── GradientBoostingRegressor
│ ├── RandomForestClassifier
│ ├── RandomForestRegressor
│ └── ...
├── feature_extraction
├── feature_selection
├── linear_model
│ ├── LinearRegression
│ ├── LogisticRegression
│ ├── Ridge
│ ├── Lasso
│ └── ...
├── metrics
├── model_selection
│ ├── cross_val_score
│ ├── train_test_split
│ ├── GridSearchCV
│ ├── RandomizedSearchCV
│ └── ...
├── naive_bayes
├── neighbors
│ ├── KNeighborsClassifier
│ ├── KNeighborsRegressor
│ └── ...
├── neural_network
│ ├── MLPClassifier
│ ├── MLPRegressor
│ └── ...
├── pipeline
├── preprocessing
│ ├── StandardScaler
│ ├── MinMaxScaler
│ ├── LabelEncoder
│ ├── OneHotEncoder
│ └── ...
├── svm
│ ├── SVC
│ ├── SVR
│ ├── LinearSVC
│ ├── LinearSVR
│ └── ...
├── tree
│ ├── DecisionTreeClassifier
│ ├── DecisionTreeRegressor
│ └── ...
└── utils↳ By referring to this hierarchy, can you please write an import statement for importing and instantiating a Support Vector Classifier under the alias of AwesomeSVC?
from sklearn.svm import SVC as AwesomeSVC
myClassifer = AwesomeSVC() # then it can be instantiated like thisIt’s a little bit obsfucated, but hopefully this clarifies the internal structure of sklearn.
In this question, we are going to test your sklearn competency. Your task is to train a random forest classifier to classify a Boba Tea as good or bad. First, attempt this Q/A thread to load in the alex-boba.csv dataset using pandas.
After completing this question, you have a dataset that looks like this in a boba DataFrame object.
| Price ($) | Volume (ml) | Sugar Level (%) | Organic | Rating |
|---|---|---|---|---|
| 4.74 | 399 | 1.9 | No | Good |
| 5.58 | 477 | 61.8 | Yes | Bad |
| 5.01 | 543 | 61.2 | No | Bad |
| 4.72 | 585 | 61.7 | Yes | Bad |
| 4.12 | 447 | 94.4 | Yes | Bad |
↳ Notice how in the rating column there are ‘good’ and ‘bad’ values. Instead of good/bad, can you replace it with a binary encoding of 0/1 using LabelEncoder()?
from sklearn.preprocessing import LabelEncoder
my_encoder = LabelEncoder() # Instantiate
boba['rating'] = my_encoder.fit_transform(boba['rating']) # Fit+Transform to the dataSuccess
.fit()
- The
.fit()method is used for training or fitting the model to the given data. It computes the necessary parameters to apply a specific algorithm (like regression, classification, or clustering) to the data.- For instance, in machine learning models,
.fit()will involve learning the weights on training data.- It does not transform or alter the data, it only learns from it.
Success
.transform()
- The
.transform()method is used to apply a transformation or an action learned from the.fit()method to the data. It’s commonly used in preprocessing steps like scaling or normalizing data.- For example, after computing the mean and standard deviation of the features in the training set using
.fit(),.transform()applies the scaling (subtracting the mean and dividing by the standard deviation) to any dataset to standardize features.
Success
.fit_transform()The
.fit_transform()method combines the.fit()and.transform()methods into one step - because they are so commonly used together. It first fits the model with the data (learns parameters) and then transforms the data accordingly. This is more efficient and often used in the preprocessing phase, like scaling or normalizing data, where it first learns the parameters (like mean, standard deviation for scaling) and then applies this transformation to the dataset in one go. Remember, something even as simple as a binary label encoder is still technically considered a ‘model’, thus it makes sense to use `.fit_transform().
↳ Your next task is to split the boba dataset into two: the features and the response/target variable, represented by X and y respectively.
Success
Xandy
- In machine learning, it is convention to represent the data as follows:
X: featuresy: response variable
X = boba.drop('rating', axis=1) # not in place
y = boba['rating'] # otherwise this wouldn't work↳ Why do you need to specify axis=1?
- The method
.drop()will, by default, drop a row instead of a column. - In pandas,
axis=1refers to columns, whereasaxis=0(the default) refers to rows. - By setting
axis=1, you’re instructing pandas to drop a column, not a row. If you do not specifyaxis=1, pandas will look for a row with the label ‘quality’ to drop, which is not the intended action in this case.
(This question follows on from the previous section)
Create a test/train split of the boba dataset with 20% test, and then explain every parameter of train_test_split.
X_train, y_train, X_test, y_test = train_test_split(X, y, test_size=0.2, random_state=2904)X: The feature set of the dataset.y: The target variable.test_size=0.2: This parameter specifies the proportion of the dataset to include in the test split.0.2means 20% of the data will be used as the test set, and the remaining 80% will be the training set.random_state=2904: The split occurs randomly. This is a seed for the random number generator. Each time you run the code with this seed, you will get the same split of data.
↳ Why is it important that the test/train split is random, instead of just taking the last 20% of rows as the test set?
- It’s crucial that the test/train split is random to ensure a representative and unbiased distribution of data between the training and test sets.
- For example, consider a scenario where Alex initially visited less expensive boba shops and later, due to sponsorship, began visiting more boujee ones.
- If the dataset were split sequentially with the last 20% as the test set, it would disproportionately represent boba from high-end shops.
- This lack of randomness could lead to a biased model that does not generalize well across different types of boba shops.
- Random splitting helps in creating more balanced and diverse subsets for training and testing, enhancing the model’s ability to perform consistently across varied data.
Most models require standardized inputs. Why?
Standardized inputs
- Many machine learning algorithms assume that all features are standardized. This means that they have mean=0 and variance=1. Read more here.
- If a feature’s variance is orders of magnitude more than the variance of other features, it might dominate the model’s learning, leading to less accurate predictions.
Standardized inputs also improve:
- The rate of convergence for algorithms that use gradient descent as an optimization technique.
- Balanced regularization: since it penalizes large weights
- Distance-based algorithms such as k-nearest neighbours
- Principal component analysis
- Interpretability
Overall, features that are all on the same scale tend to be healthier input when it comes to all aspects of a machine learning model.
This is an extremely important preprocessing step.
It ensures that each feature contributes equally to the model’s learning process, preventing features with larger scales from dominating how the model learns.
Otherwise, the model may bias towards features that have larger values (due to their units) but not necessarily more importance. For example, volume (ml) is always going to have higher values than cost ($), but cost might be a more powerful feature.
↳ How do we standardize data in sklearn?
Success
StandardScaler()
- From the module
sklearn.preprocessing
↳ What is wrong with this code, which is used to standardize the X_train and X_test feature columns?
sc = StandardScaler() # from sklearn.preprocessing
sc.fit_transform(X_train)
sc.fit_transform(X_test)You should have spotted the following things:
.fit_transformis a method of theStandardScalerinstance, which does not transformX_trainin place, but rather returns the transformed data. Thus, when this method is called, the return value does not get assigned to anything.- The line
sc.fit_transform(X_test)is wrong. You should have identified this as data leakage. The test data must be standardized using the mean and variance valuesfitfrom the training data. This ensures that the model has no prior knowledge of the test data during training, and that when we evaluate the model using this test data, it gives a fair indication of how it will perform on new, unseen data.
The corrected code:
sc = StandardScaler() # from sklearn.preprocessing
X_train = sc.fit_transform(X_train) # fits the mean/variance from the training data
X_test = sc.transform(X_test) # scales using the previous mean/varianceOkay, I’m happy with how you’ve preprocessed the data. Your next step is to train a model. Please explain how to do this in sklearn.
Training a model in sklearn is essentially always the same 3 steps (please remember this):
-
Model Instantiation:
- Write
MyModel = ModelClass(), whereModelClassis your chosensklearnmodel, likeLinearRegressionorRandomForestClassifier. This step creates an instance of the model class, sort of copy and pasting it from its template form - Set hyper-parameters at instantiation, e.g.,
RandomForestClassifier(n_estimators=100).
- Write
-
Fitting the Model:
- Apply the
.fit()method on the model instance withX_trainandy_trainas arguments. Syntax:model.fit(X_train, y_train). - This method adjusts internal parameters based on data, effectively training the model.
- Apply the
-
Predicting with the Model:
- Use
.predict()for generating predictions, passingX_testas the argument:model.predict(X_test). - This method uses the learned parameters to make predictions on new data.
- Use
Okay, using those previous steps, train a 3 different types of models, and explain how well suited each one is to the task.
Random Forest Classifier
rfc = RandomForestClassifier(n_estimators=200) # from sklearn.ensemble
rfc.fit(X_train, y_train)
predictions = rfc.predict(X_test)- RFC’s have less moving parts to fine tune
- Usually start with a high
n_estimators(trees) then bring it down, and see how performance was affected. - Good for medium sized datasets
Support Vector Machine
svm = SVC() # from sklearn.svm
svm.fit(X_train, y_train)
predictions = svm.predict(X_test)- Good for small sized datasets
- Works better with raw and ready data
Multilayered Perceptron (Neural Network)
mlpc = MLPClassifier(hidden_layer_sizes=(12, 12, 12, 12), max_iter=500) # from sklearn.neural_network
mlpc.fit(X_train, y_train)
predictions = mlpc.predict(X_test)- Probably overkill for this task
- Not enough data
We need to determine which model performed best. Can you explain and demonstrate some methods to evaluate model performance in scikit-learn?
- Accuracy Measurement:
- Sometimes, for shareholder presentation, you just need something simple
- For classification tasks, use
accuracy_score. Syntax:accuracy_score(y_true, y_pred). - This function compares the actual labels (
y_true) with the predicted labels (y_pred) and returns the proportion of correct predictions.
- Confusion Matrix:
- Implement
confusion_matrix. Syntax:confusion_matrix(y_true, y_pred). - Useful for classification tasks, it provides a detailed breakdown of correct and incorrect predictions for each class.
- Precision, Recall, and F1-Score:
- Use
precision_score,recall_score, andf1_score. Syntax:precision_score(y_true, y_pred),recall_score(y_true, y_pred),f1_score(y_true, y_pred). - These metrics provide insight into the balance between the true positives, false positives, and false negatives. Particularly useful when dealing with imbalanced datasets.
- ROC Curve and AUC:
- For binary classification, utilize
roc_auc_scoreandplot_roc_curve. Syntax:roc_auc_score(y_true, y_score),plot_roc_curve(model, X_test, y_test). - These methods assess model performance based on the trade-off between true positive rate and false positive rate.
- Mean Squared Error (MSE) and Mean Absolute Error (MAE):
- For regression tasks, use
mean_squared_errorandmean_absolute_error. Syntax:mean_squared_error(y_true, y_pred),mean_absolute_error(y_true, y_pred). - MSE and MAE provide measures of model accuracy in predicting quantitative data.
- Cross-Validation:
- Implement
cross_val_score. Syntax:cross_val_score(model, X, y, cv=number_of_folds). - This function splits the data into a specified number of folds, training and evaluating the model on each fold. It provides a more robust assessment of model performance.
↳ Can you, step by step, interpret the following output of classification_report
precision recall f1-score support
0 0.93 0.99 0.96 279
1 0.91 0.51 0.66 41
accuracy 0.93 320
macro avg 0.92 0.75 0.81 320
weighted avg 0.93 0.93 0.92 320Support is simply the number of occurrences of each class in the test dataset
The top section (by class)
- The report shows metrics for two classes (labelled 0 and 1).
- Class 0:
- Precision: 0.93 - When the model predicts class 0, it is correct 93% of the time.
- Recall: 0.99 - The model correctly identifies 99% of all actual class 0 instances.
- F1-Score: 0.96 - A harmonic mean of precision and recall, indicating a high balance between the two for class 0.
- Support: 279 - The number of actual occurrences of class 0 in the dataset.
- Class 1:
- Precision: 0.91 - When the model predicts class 1, it is correct 91% of the time.
- Recall: 0.51 - The model correctly identifies 51% of all actual class 1 instances.
- F1-Score: 0.66 - The F1-score for class 1 is lower, suggesting a lower balance between precision and recall for this class.
- Support: 41 - The number of actual occurrences of class 1 in the dataset.
The bottom section (overall model performance)
- Accuracy: 0.93 - Overall, the model correctly predicts the class 93% of the time across both classes.
- Macro Average:
- Calculated by taking the average of the precision, recall, and F1-score for each class without considering support.
- Precision: 0.92, Recall: 0.75, F1-Score: 0.81.
- Weighted Average:
- Similar to the macro average, but it also accounts for the support of each class. This gives a better representation of performance on imbalanced datasets.
- Precision: 0.93, Recall: 0.93, F1-Score: 0.92.
How can you figure out which hyperparmaeters
Hyperparameter Tuning: Optionally, use GridSearchCV or RandomizedSearchCV for hyperparameter optimization.
How can you do more robust model validation? (I.e be more confident that you chose the correct model)
- Cross-Validation: Implement cross-validation with
cross_val_scoreorcross_validatefor more robust model validation.
Alex has returned! He just visited a brand new boba store, Citadel Sips, and has given you all of his data (Price ($) | Volume (ml) | Sugar Level (%) | Organic ), but doesn’t want to tell you the rating yet. Using the model you have trained in sklearn, how can you predict what he rated the new boba?
Simply enter in the data as one row, don’t forget to transform it, and then call model.predict()
new_data = `[[8.99, 640, 33.3, 1]]`
new_data = sc.transform(new_data) # remember, the sc instance is still fit properly.
prediction = mlpc.predict(new_data)P.S: Alex rated it a 9/10!!!