These questions were curated from the application process of large software development companies such as Google, Atlassian, Canva, and Amazon.
For Data Science refer to 2.4 pandas
-
How to maintain in deploymnet
-
Cross validation Models (from the 42 + stat question)
-
Logistic
-
SVM
-
Linear Regression +
-
Grid Search
-
Regularisation
- Lasso
- Ridge
-
what is machine learning?
-
Machine learning is autocorrect - think, anything that would be hard to explicitly program… sometimes you don’t realise until you try.
These sections will take you through the terminology and fundamental building blocks Things which are the building blocks for the following models. Which you must understand before you explore all of them. This is the language which will allow you to become fluent in machine learning, no matter which of the following X model’s we’re discussing. That said, they’re sometimes hard to understadn without reference to a specific example model.
-
Training/Testing
- the real meaning of ‘training’
-
‘train-test split’
- More precisely, training = estimating parameters
- More precisely, testing =
-
Cross validation
-
Confusion matrix
-
The data (just covered)
-
Bias Variance tradeoff
-
Loss function
-
Inference - a general concept used. You may hear the term ‘running inference on a model’ that has just been trained. It essentially just means, having trained the model, to use it in order to generate classifications or predictions.
-
Explanaibility
-
Corpus
ChatGPT ‘predicts’ the next best word.
-
ROC curve
-
AOC curve
-
Functions
-
Log function
-
Prob and Stats basics (normal distribution) and (basic stats)
A/B testing
Accuracy Binary Classification
Data augmentation Implicit Bias Multi class classification TN, TP, TPR, Validation Set, Variable imporance
Few shot learning
Generalization: In the context of machine learning, the core objective is to build predictive models that perform well not just on the data that they were trained on, but also on new, unseen data. This is known as generalization.
(1.1) Easy Questions (1.3) Normal Distribution
What is a parameter
A parameter
Is a weight learnt by the model during training
What is data pre-processing
Preprocessing
All the steps that need to happen before the data can be ‘fed’ to the model.
What is data imputation?
Standardisation vs. Normalisation
↳ What are some different ways to impute missing data?
Q2) What is a function?
☞ Function: A function is a machine that takes in an input and maps it to an output.
If you have a reasonable understanding of machine learning, you can appreciate that all ‘machine learning models’ are really just abstracted ‘functions’ which take in data and spit out a result
- The same way the function takes in the number , and returns the number squared
- Or the function below takes in a list, and returns the list sorted
def sort_list(my_list):
return sorted(my_list)- A machine learning model takes in data and returns an answer, but often both the process of going from input to output is much more complicated than a simple
- You should really think about most machine learning models as plain old functions
Universality theorems are a commonplace in computer science, so much so that we sometimes forget how astonishing they are. But it’s worth reminding ourselves: the ability to compute an arbitrary function is truly remarkable. Almost any process you can imagine can be thought of as function computation. Consider the problem of naming a piece of music based on a short sample of the piece. That can be thought of as computing a function. Or consider the problem of translating a Chinese text into English. Again, that can be thought of as computing a functionActually, computing one of many functions, since there are often many acceptable translations of a given piece of text.. Or consider the problem of taking an mp4 movie file and generating a description of the plot of the movie, and a discussion of the quality of the acting. Again, that can be thought of as a kind of function computationDitto the remark about translation and there being many possible functions.. Universality means that, in principle, neural networks can do all these things and many more.
A vector valued function (also called a vector function) is a function (not a vector) that outputs a vector, as opposed to a scalar or real value.
| → | f(x) | ||
|---|---|---|---|
| ChatGPT | Prompt | Response |
- plain and simple.
- It is quite
- The better the data represents the problem, the better the results.
- ‘You are what you eat’
- The model is trained on the data. If the data is fundamentally wrong, then the model will still do the right thing - it may learn the data perfectly - but as a result, the model’s output is also wrong due to the data being wrong.
- Remember how the model uses data to estimate its parameters so that it can adjust itself to perform well on unseen data? Clearly, bad data will lead to poor estimates of these parameters. I should be clean, and representative of the unseen data that will be fed into it; if you’re trying to train a model to recognise dog breeds, you wouldn’t feed it pictrues of trains would you.
Please show me, with your hands, what the graph of and look like.
Firstly, there are inverses of each other (reflected on the diagonal line ).
These two functions are fundamental to data science, machine learning, and deep learning. They are used relentlessly in statistics and probability - basically cannot use them without it. Therefore it’s important to have a solid grasp of how they behave.
They’re used in softmax, mutual information
Why are e^x and the natural log so common in probability and machine learning? What is so special about e?
The number ( e ) (approximately 2.71828) and its associated functions, ( e^x ) (the exponential function) and the natural logarithm ((\ln(x))), are fundamental in probability and machine learning for several reasons:
1. Continuous Growth and Decay
- Exponential Growth: The exponential function ( e^x ) is the mathematical representation of continuous growth. When something grows at a rate proportional to its current size (like interest in banking or population growth), it’s described by ( e^x ).
- Continuous Compounding: In finance, ( e ) arises from the limit of compound interest as the number of compounding periods becomes infinitely large.
2. Mathematical Properties
- Differentiation and Integration: One of the most remarkable properties of ( e^x ) is that its derivative is itself. This makes calculations involving differential equations simpler, which is essential in many machine learning algorithms, especially those involving continuous optimization.
- Simplifies Calculations: The natural logarithm ((\ln(x))) is the inverse of the exponential function. It transforms multiplicative relationships into additive ones, simplifying many calculations.
3. Probability and Statistics
- Normal Distribution: In statistics, the normal (Gaussian) distribution, a fundamental probability distribution, has ( e ) in its equation. This distribution is pivotal in many statistical methods and machine learning algorithms.
- Information Theory: In information theory, entropy (a measure of uncertainty or surprise) is often calculated using the natural logarithm. This choice is partly because the natural logarithm measures the number of e-folding times needed to reach a certain level of growth.
4. Machine Learning Algorithms
- Log-Likelihood: Many machine learning models (like logistic regression) use the natural logarithm in the log-likelihood function, which is maximized to fit the model to the data.
- Activation Functions: Exponential functions are used in certain activation functions in neural networks, such as the softmax function, which generalizes the logistic function to multiple classes.
5. Universality of e
- Ubiquitous in Nature: The number ( e ) appears in many natural phenomena (such as patterns of growth and decay, the distribution of prime numbers, etc.), making it a natural choice for models that attempt to mimic or understand natural processes.
Conclusion
The prevalence of ( e ) and its related functions in probability and machine learning is due to their unique mathematical properties, their natural appearance in growth/decay processes, their fundamental role in statistics and information theory, and their utility in simplifying complex mathematical expressions. These functions provide a powerful toolkit for modeling a wide range of phenomena in machine learning and beyond.
QX) What are the two types of data?
Structured
Unstructured
A good proxy for how ‘structured’ something is, is how easy it would be to represent the data in a table. I.e: very hard for images/videos, very easy for a company’s expenses
QX) What is metadata with an example?
☞ Metadata is data that describes other data.
Example
When you take a photo on your phone, the details that come up when you press (ℹ︎) are metadata describing the main data - the image.
Time_taken: 2:03:29am
Location: The White House
Size: 16MP
Dimensions: 1440x1440
Metadata is relevant to machine learning because it provides context, aids in data preprocessing, enhances data quality, and can improve model interpretability and performance.
Q1) Explain the concept of ‘garbage in garbage Out’ and motivate with an example
Q1) What is a feature?
Featuring engineering is one of a Data Scientist’s jobs. What is it?
The features must contain some kind of signal that is valuable to the algorithm for its classification/prediction
So the raw data must be transformed into new features that better represent the underlying problem.
What is the typical sk-learn pipeline for implementing a machine learning model?
Please read sklearn if you don’t already know. Knowing the 5 steps will give you access to every machine learning model. Literally cover 50% of your machine learning career - implementation, the other 50% being how it actually all works.
In statistics and machine learning, what do we mean by random noise?
Success
- Random noise in statistics and machine learning is the term used for the unpredictable and irregular fluctuations that are present in all datasets. It’s like the static you hear on the radio or the grain you see on a photograph, which doesn’t come from the actual music or the image itself, but from random disturbances that can’t be controlled. In data, noise could come from measurement errors, slight variations in the process being measured, or other unknown factors that cause the data to vary even under identical conditions.
- For instance, if you’re measuring the height of a plant every day to track its growth, factors like slight differences in how you place the ruler, small tremors in your hand, or even changes in room temperature that slightly expand or contract the ruler could introduce noise to your measurements. This noise is not related to the actual growth of the plant, but it still affects your data.
How should machine learning algorithms interpret noise?
In machine learning
- We want our algorithms to learn from the real patterns in the data (the signal) and ignore this noise. However, it’s often challenging because noise can mask the true patterns we’re interested in.
- A bad model may accidentally overfit, mistaking the random noise for the underlying pattern, and thus making wrong predictions later down the line.
- A good model will be able to distinguish between the signal and the noise, focusing on the true patterns that help make accurate predictions or decisions.
What is overfitting?
A model that is overfit has low bias (good) and high variance (bad). This happens when the model learns noise and fluctuations in the training data as if they were significant features. It thus performs poorly on real, unseen data.
In the context of machine learning models, overfitting is when the model ‘learns’ the training data too well, resulting in almost no bias, but high variance.
Let’s say Tao is studying (learning) for an upcoming exam.
- Memorizing past exam papers is akin to a model that overfits to its training data.
- New questions on the exam represent new, unseen data for a predictive model.
- Genuine understanding of the material is similar to a well-generalized model that can perform well on both training and unseen data.
Follow up: what would an overfit ?
What is underfitting?
MEGATHREAD: What evaluation metrics do we use to evaluate the effectiveness of a machine learning model?
☞ In statistics we are usually evaluating ‘estimators’ of population parameters; e.g the population mean, the true regression line. ☞ In machine learning we evaluate entire models, which estimate the entire model
Precision
The % of positive classifications that were correct
Accuracy
Recall
Difference between online vs. offline metrics
Online vs. Offline metrics
Online metrics are used to see the system’s performance through online evaluations on live data during an A/B test. Offline metrics are used in offline evaluations, which simulate the model’s performance in the production environment.
What is a loss function, heuristically, holistically, in general, what is the ideology of a loss function?
What metrics do we use to evaluate the effectiveness of a statistical estimator?
- Bias
- Variance
- Efficiency
- Consistency
- Precision
Most statistical models are also classified as machine learning models (e.g Linear Regression) so you can see how quickly things get confusing!
Every machine learning model learns to get better by comparing its output to an error/loss function
- For classification problems (categorical)
- Accuracy (how many did you get wrong)
- For regression problems (numeric)
- Mean squared error
The Confusion Matrix
Cross Validation
Bias/Variance tradeoff
- Bias: measures the tendency of the model to fit the training data
- Variance: this type of model tends (graph)
How is this different to statistical bias/variance
In statistics, an estimator (model) is unbiased if, in the long run, its average value is equal to the true underlying value.
A model’s variance measures how varied its results are
What is inference?
☞ 𝐌𝐋 inference is basically just the act of actually using the trained model on new, unseen data. It is the post-training step where the model applies its learned knowledge to provide outputs for inputs it hasn’t encountered during training. Essentially, it’s where the model is put to work in practical applications, such as predicting outcomes, classifying data, or generating content.
Statistical inference is the process of drawing conclusions about a larger population or underlying data process based on a sample from that population. E.g what is the mean weight of boys at your school, based on the mean weight of boys in your class ✻.
QX) Explain a machine learning project you have worked on in the past. Why did you choose that specific model and how did you prepare your data?
- A: I worked on a sentiment analysis project to determine customer feedback polarity. I chose the Naive Bayes model due to its efficiency with textual data. For data preparation, I tokenized the reviews, removed stopwords, and used TF-IDF for feature extraction.
What is data leakage in cross validation?
Data; dirty data, high variance data, garbage in garbage out, signal
Train vs. Validate vs. Test
What is a class in ML?
Class
A class is another name for a category.
Classification
A datapoints’s classification is the category that it belongs to.
What are the different types of classification problems in machine learning?
There are
What is a ‘dimension’ of data in ML
A degree of freedom which specifies the value of the datapoint (think: how many different would you need to fully specify the data) - you couldn’t specify the shape of a phone with just one number, you would need height and width.
A loose proxy; just think number of columns in the dataset
↳ Can you give an example of N-dimensional data?
What is a hyperparameter?
Hyperparameter
- Hyperparameters are not learnt from the data, but chosen before the learning process begins.
- High level ‘settings’ of a model
- Best understood via examples:
- The number of layers in a neural network
- The ‘k’ in KNN.
- The kernel type in SVM’s.
What is a validation set?
- Different to train and test set. It is derived from a division of the training set.
To be very clear:
- training set
- validation set (test against, then adjust model or hyperparameters)
- testing set (finally, )
↳ Why couldn’t you just use the test set for this?
The use of a validation set is crucial for model selection because it can help to estimate the model’s performance and select the best model and hyperparameters without using the test set. The test set should only be used once to avoid biasing the model performance metrics.
By this, I mean that if you adjusted the hyperparameters based on the test set, you would simply be optimizing the setup the perform best on that particular test set, which may not be representative of the true data, and thus you have the issue again.
The test set should only be used once because using it multiple times during the model selection and tuning process can lead to “leakage” where information about the test set leaks into the model. This can cause you to inadvertently tune your model to perform well on the test set, rather than on new, unseen data.
↳ What are the three main ways of creating the validation set?
- Holdout Method: Simply dividing your dataset into separate training, validation, and test sets.
- K-Fold Cross-Validation: Dividing your dataset into K parts, using K-1 parts for training and the remaining part for validation. This is repeated K times with a different part used as the validation set each time.
- Leave-One-Out Cross-Validation: A special case of cross-validation where the number of folds equals the number of instances in the dataset, essentially using all but one instance for training and the last instance for validation, repeated for each instance in the dataset.