1.2 Easy Questions

These questions were curated from the application process of large software development companies such as Google, Atlassian, Canva, and Amazon.

For Data Science refer to 2.4 pandas

How to maintain in deploymnet
Cross validation Models (from the 42 + stat question)
Logistic
SVM
Linear Regression +
Grid Search
Regularisation
- Lasso
- Ridge
what is machine learning?
Machine learning is autocorrect - think, anything that would be hard to explicitly program… sometimes you don’t realise until you try.

These sections will take you through the terminology and fundamental building blocks Things which are the building blocks for the following models. Which you must understand before you explore all of them. This is the language which will allow you to become fluent in machine learning, no matter which of the following X model’s we’re discussing. That said, they’re sometimes hard to understadn without reference to a specific example model.

Training/Testing
- the real meaning of ‘training’
‘train-test split’
- More precisely, training = estimating parameters
- More precisely, testing =
Cross validation
Confusion matrix
The data (just covered)
Bias Variance tradeoff
Loss function
Inference - a general concept used. You may hear the term ‘running inference on a model’ that has just been trained. It essentially just means, having trained the model, to use it in order to generate classifications or predictions.
Explanaibility
Corpus

ChatGPT ‘predicts’ the next best word.

ROC curve
AOC curve
Functions
Log function
Prob and Stats basics (normal distribution) and (basic stats)

A/B testing

Accuracy Binary Classification

Data augmentation Implicit Bias Multi class classification TN, TP, TPR, Validation Set, Variable imporance

Few shot learning

Generalization: In the context of machine learning, the core objective is to build predictive models that perform well not just on the data that they were trained on, but also on new, unseen data. This is known as generalization.

(1.1) Easy Questions (1.3) Normal Distribution

What is a parameter

A parameter

Is a weight learnt by the model during training

What is data pre-processing

Preprocessing

All the steps that need to happen before the data can be ‘fed’ to the model.

What is data imputation?

Standardisation vs. Normalisation

↳ What are some different ways to impute missing data?

Q2) What is a function?

☞ Function: A function is a machine that takes in an input and maps it to an output.

If you have a reasonable understanding of machine learning, you can appreciate that all ‘machine learning models’ are really just abstracted ‘functions’ which take in data and spit out a result

The same way the function $f (x) = x^{2}$ takes in the number $x$ , and returns the number squared
Or the function below takes in a list, and returns the list sorted

def sort_list(my_list):
	return sorted(my_list)

A machine learning model takes in data and returns an answer, but often both the process of going from input to output is much more complicated than a simple $y = x^{2}$
You should really think about most machine learning models as plain old functions

Universality theorems are a commonplace in computer science, so much so that we sometimes forget how astonishing they are. But it’s worth reminding ourselves: the ability to compute an arbitrary function is truly remarkable. Almost any process you can imagine can be thought of as function computation. Consider the problem of naming a piece of music based on a short sample of the piece. That can be thought of as computing a function. Or consider the problem of translating a Chinese text into English. Again, that can be thought of as computing a functionActually, computing one of many functions, since there are often many acceptable translations of a given piece of text.. Or consider the problem of taking an mp4 movie file and generating a description of the plot of the movie, and a discussion of the quality of the acting. Again, that can be thought of as a kind of function computationDitto the remark about translation and there being many possible functions.. Universality means that, in principle, neural networks can do all these things and many more.

A vector valued function (also called a vector function) is a function (not a vector) that outputs a vector, as opposed to a scalar or real value.

	$x$	→	f(x)
ChatGPT	Prompt		Response

plain and simple.
It is quite
The better the data represents the problem, the better the results.
‘You are what you eat’
The model is trained on the data. If the data is fundamentally wrong, then the model will still do the right thing - it may learn the data perfectly - but as a result, the model’s output is also wrong due to the data being wrong.
Remember how the model uses data to estimate its parameters so that it can adjust itself to perform well on unseen data? Clearly, bad data will lead to poor estimates of these parameters. I should be clean, and representative of the unseen data that will be fed into it; if you’re trying to train a model to recognise dog breeds, you wouldn’t feed it pictrues of trains would you.

Please show me, with your hands, what the graph of $e^{x}$ and $l o g_{e} (x)$ look like.

Firstly, there are inverses of each other (reflected on the diagonal line $y = x$ ).

These two functions are fundamental to data science, machine learning, and deep learning. They are used relentlessly in statistics and probability - basically cannot use them without it. Therefore it’s important to have a solid grasp of how they behave.

They’re used in softmax, mutual information

Why are e^x and the natural log so common in probability and machine learning? What is so special about e?

The number ( e ) (approximately 2.71828) and its associated functions, ( e^x ) (the exponential function) and the natural logarithm ((\ln(x))), are fundamental in probability and machine learning for several reasons:

1. Continuous Growth and Decay

Exponential Growth: The exponential function ( e^x ) is the mathematical representation of continuous growth. When something grows at a rate proportional to its current size (like interest in banking or population growth), it’s described by ( e^x ).
Continuous Compounding: In finance, ( e ) arises from the limit of compound interest as the number of compounding periods becomes infinitely large.

2. Mathematical Properties

Differentiation and Integration: One of the most remarkable properties of ( e^x ) is that its derivative is itself. This makes calculations involving differential equations simpler, which is essential in many machine learning algorithms, especially those involving continuous optimization.
Simplifies Calculations: The natural logarithm ((\ln(x))) is the inverse of the exponential function. It transforms multiplicative relationships into additive ones, simplifying many calculations.

3. Probability and Statistics

Normal Distribution: In statistics, the normal (Gaussian) distribution, a fundamental probability distribution, has ( e ) in its equation. This distribution is pivotal in many statistical methods and machine learning algorithms.
Information Theory: In information theory, entropy (a measure of uncertainty or surprise) is often calculated using the natural logarithm. This choice is partly because the natural logarithm measures the number of e-folding times needed to reach a certain level of growth.

4. Machine Learning Algorithms

Log-Likelihood: Many machine learning models (like logistic regression) use the natural logarithm in the log-likelihood function, which is maximized to fit the model to the data.
Activation Functions: Exponential functions are used in certain activation functions in neural networks, such as the softmax function, which generalizes the logistic function to multiple classes.

5. Universality of e

Ubiquitous in Nature: The number ( e ) appears in many natural phenomena (such as patterns of growth and decay, the distribution of prime numbers, etc.), making it a natural choice for models that attempt to mimic or understand natural processes.

Conclusion

The prevalence of ( e ) and its related functions in probability and machine learning is due to their unique mathematical properties, their natural appearance in growth/decay processes, their fundamental role in statistics and information theory, and their utility in simplifying complex mathematical expressions. These functions provide a powerful toolkit for modeling a wide range of phenomena in machine learning and beyond.

QX) What are the two types of data?

Structured

Unstructured

A good proxy for how ‘structured’ something is, is how easy it would be to represent the data in a table. I.e: very hard for images/videos, very easy for a company’s expenses

QX) What is metadata with an example?

☞ Metadata is data that describes other data.

Example When you take a photo on your phone, the details that come up when you press (ℹ︎) are metadata describing the main data - the image.
Time_taken: 2:03:29am Location: The White House Size: 16MP Dimensions: 1440x1440

Metadata is relevant to machine learning because it provides context, aids in data preprocessing, enhances data quality, and can improve model interpretability and performance.

Q1) Explain the concept of ‘garbage in garbage Out’ and motivate with an example

Q1) What is a feature?

Featuring engineering is one of a Data Scientist’s jobs. What is it?

The features must contain some kind of signal that is valuable to the algorithm for its classification/prediction

So the raw data must be transformed into new features that better represent the underlying problem.

What is the typical `sk-learn` pipeline for implementing a machine learning model?

Please read sklearn if you don’t already know. Knowing the 5 steps will give you access to every machine learning model. Literally cover 50% of your machine learning career - implementation, the other 50% being how it actually all works.

In statistics and machine learning, what do we mean by random noise?

Success

Random noise in statistics and machine learning is the term used for the unpredictable and irregular fluctuations that are present in all datasets. It’s like the static you hear on the radio or the grain you see on a photograph, which doesn’t come from the actual music or the image itself, but from random disturbances that can’t be controlled. In data, noise could come from measurement errors, slight variations in the process being measured, or other unknown factors that cause the data to vary even under identical conditions.

For instance, if you’re measuring the height of a plant every day to track its growth, factors like slight differences in how you place the ruler, small tremors in your hand, or even changes in room temperature that slightly expand or contract the ruler could introduce noise to your measurements. This noise is not related to the actual growth of the plant, but it still affects your data.

How should machine learning algorithms interpret noise?

In machine learning

We want our algorithms to learn from the real patterns in the data (the signal) and ignore this noise. However, it’s often challenging because noise can mask the true patterns we’re interested in.

A bad model may accidentally overfit, mistaking the random noise for the underlying pattern, and thus making wrong predictions later down the line.

A good model will be able to distinguish between the signal and the noise, focusing on the true patterns that help make accurate predictions or decisions.

What is overfitting?

A model that is overfit has low bias (good) and high variance (bad). This happens when the model learns noise and fluctuations in the training data as if they were significant features. It thus performs poorly on real, unseen data.

In the context of machine learning models, overfitting is when the model ‘learns’ the training data too well, resulting in almost no bias, but high variance.

Let’s say Tao is studying (learning) for an upcoming exam.

Memorizing past exam papers is akin to a model that overfits to its training data.
New questions on the exam represent new, unseen data for a predictive model.
Genuine understanding of the material is similar to a well-generalized model that can perform well on both training and unseen data.

Follow up: what would an overfit ?

What is underfitting?

MEGATHREAD: What evaluation metrics do we use to evaluate the effectiveness of a machine learning model?

☞ In statistics we are usually evaluating ‘estimators’ of population parameters; e.g the population mean, the true regression line. ☞ In machine learning we evaluate entire models, which estimate the entire model

Precision

The % of positive classifications that were correct

Accuracy

Recall

Difference between online vs. offline metrics

Online vs. Offline metrics

Online metrics are used to see the system’s performance through online evaluations on live data during an A/B test. Offline metrics are used in offline evaluations, which simulate the model’s performance in the production environment.

What is a loss function, heuristically, holistically, in general, what is the ideology of a loss function?

What metrics do we use to evaluate the effectiveness of a statistical estimator?

Bias
Variance
Efficiency
Consistency
Precision

Most statistical models are also classified as machine learning models (e.g Linear Regression) so you can see how quickly things get confusing!

Every machine learning model learns to get better by comparing its output to an error/loss function

For classification problems (categorical)
- Accuracy (how many did you get wrong)
For regression problems (numeric)
- Mean squared error

The Confusion Matrix

Cross Validation

#cross_validation

Bias/Variance tradeoff

Bias: measures the tendency of the model to fit the training data
Variance: this type of model tends (graph)

How is this different to statistical bias/variance

In statistics, an estimator (model) is unbiased if, in the long run, its average value is equal to the true underlying value.

A model’s variance measures how varied its results are

What is inference?

☞ 𝐌𝐋 inference is basically just the act of actually using the trained model on new, unseen data. It is the post-training step where the model applies its learned knowledge to provide outputs for inputs it hasn’t encountered during training. Essentially, it’s where the model is put to work in practical applications, such as predicting outcomes, classifying data, or generating content.

Statistical inference is the process of drawing conclusions about a larger population or underlying data process based on a sample from that population. E.g what is the mean weight of boys at your school, based on the mean weight of boys in your class ✻.

✻ Please never do this

QX) Explain a machine learning project you have worked on in the past. Why did you choose that specific model and how did you prepare your data?

A: I worked on a sentiment analysis project to determine customer feedback polarity. I chose the Naive Bayes model due to its efficiency with textual data. For data preparation, I tokenized the reviews, removed stopwords, and used TF-IDF for feature extraction.

What is data leakage in cross validation?

Data; dirty data, high variance data, garbage in garbage out, signal

Train vs. Validate vs. Test

What is a class in ML?

Class

A class is another name for a category.

Classification

A datapoints’s classification is the category that it belongs to.

What are the different types of classification problems in machine learning?

There are

What is a ‘dimension’ of data in ML

A degree of freedom which specifies the value of the datapoint (think: how many different would you need to fully specify the data) - you couldn’t specify the shape of a phone with just one number, you would need height and width.

A loose proxy; just think number of columns in the dataset

↳ Can you give an example of N-dimensional data?

What is a hyperparameter?

Hyperparameter

Hyperparameters are not learnt from the data, but chosen before the learning process begins.
High level ‘settings’ of a model
Best understood via examples:
- The number of layers in a neural network
- The ‘k’ in KNN.
- The kernel type in SVM’s.

What is a validation set?

Different to train and test set. It is derived from a division of the training set.

To be very clear:

training set
validation set (test against, then adjust model or hyperparameters)
testing set (finally, )

↳ Why couldn’t you just use the test set for this?

The use of a validation set is crucial for model selection because it can help to estimate the model’s performance and select the best model and hyperparameters without using the test set. The test set should only be used once to avoid biasing the model performance metrics.

By this, I mean that if you adjusted the hyperparameters based on the test set, you would simply be optimizing the setup the perform best on that particular test set, which may not be representative of the true data, and thus you have the issue again.

The test set should only be used once because using it multiple times during the model selection and tuning process can lead to “leakage” where information about the test set leaks into the model. This can cause you to inadvertently tune your model to perform well on the test set, rather than on new, unseen data.

↳ What are the three main ways of creating the validation set?

Holdout Method: Simply dividing your dataset into separate training, validation, and test sets.
K-Fold Cross-Validation: Dividing your dataset into K parts, using K-1 parts for training and the remaining part for validation. This is repeated K times with a different part used as the validation set each time.
Leave-One-Out Cross-Validation: A special case of cross-validation where the number of folds equals the number of instances in the dataset, essentially using all but one instance for training and the last instance for validation, repeated for each instance in the dataset.

⚘ DSX.com

Explorer

1.2 Easy Questions

What is a parameter §

What is data pre-processing §

What is data imputation? §

Standardisation vs. Normalisation §

↳ What are some different ways to impute missing data? §

Q2) What is a function? §

Please show me, with your hands, what the graph of ex and loge​(x) look like. §

Why are e^x and the natural log so common in probability and machine learning? What is so special about e? §

1. Continuous Growth and Decay §

2. Mathematical Properties §

3. Probability and Statistics §

4. Machine Learning Algorithms §

5. Universality of e §

Conclusion §

QX) What are the two types of data? §

QX) What is metadata with an example? §

Q1) Explain the concept of ‘garbage in garbage Out’ and motivate with an example §

Q1) What is a feature? §

Featuring engineering is one of a Data Scientist’s jobs. What is it? §

What is the typical sk-learn pipeline for implementing a machine learning model? §

In statistics and machine learning, what do we mean by random noise? §

How should machine learning algorithms interpret noise? §

What is overfitting? §

Follow up: what would an overfit ? §

What is underfitting? §

MEGATHREAD: What evaluation metrics do we use to evaluate the effectiveness of a machine learning model? §

Difference between online vs. offline metrics §

What is a loss function, heuristically, holistically, in general, what is the ideology of a loss function? §

What metrics do we use to evaluate the effectiveness of a statistical estimator? §

Cross Validation §

Bias/Variance tradeoff §

How is this different to statistical bias/variance §

What is inference? §

QX) Explain a machine learning project you have worked on in the past. Why did you choose that specific model and how did you prepare your data? §

What is data leakage in cross validation? §

Data; dirty data, high variance data, garbage in garbage out, signal §

Train vs. Validate vs. Test §

What is a class in ML? §

What are the different types of classification problems in machine learning? §

What is a ‘dimension’ of data in ML §

↳ Can you give an example of N-dimensional data? §

What is a hyperparameter? §

What is a validation set? §

↳ Why couldn’t you just use the test set for this? §

↳ What are the three main ways of creating the validation set? §

Graph View

Table of Contents

Backlinks

What is a parameter

What is data pre-processing

What is data imputation?

Standardisation vs. Normalisation

↳ What are some different ways to impute missing data?

Q2) What is a function?

Please show me, with your hands, what the graph of $e^{x}$ and $l o g_{e} (x)$ look like.

Why are e^x and the natural log so common in probability and machine learning? What is so special about e?

1. Continuous Growth and Decay

2. Mathematical Properties

3. Probability and Statistics

4. Machine Learning Algorithms

5. Universality of e

Conclusion

QX) What are the two types of data?

QX) What is metadata with an example?

Q1) Explain the concept of ‘garbage in garbage Out’ and motivate with an example

Q1) What is a feature?

Featuring engineering is one of a Data Scientist’s jobs. What is it?

What is the typical `sk-learn` pipeline for implementing a machine learning model?

In statistics and machine learning, what do we mean by random noise?

How should machine learning algorithms interpret noise?

What is overfitting?

Follow up: what would an overfit ?

What is underfitting?

MEGATHREAD: What evaluation metrics do we use to evaluate the effectiveness of a machine learning model?

Difference between online vs. offline metrics

What is a loss function, heuristically, holistically, in general, what is the ideology of a loss function?

What metrics do we use to evaluate the effectiveness of a statistical estimator?

Cross Validation

Bias/Variance tradeoff

How is this different to statistical bias/variance

What is inference?

QX) Explain a machine learning project you have worked on in the past. Why did you choose that specific model and how did you prepare your data?

What is data leakage in cross validation?

Data; dirty data, high variance data, garbage in garbage out, signal

Train vs. Validate vs. Test

What is a class in ML?

What are the different types of classification problems in machine learning?

What is a ‘dimension’ of data in ML

↳ Can you give an example of N-dimensional data?

What is a hyperparameter?

What is a validation set?

↳ Why couldn’t you just use the test set for this?

↳ What are the three main ways of creating the validation set?