Why do we need to represent text as numbers?
Vectors
Machine learning models take vectors (arrays of numbers) as input.
[0.32, 0.49, 0.33, 0.69, 0.56]
Vectorisation
When working with text, the first thing you must do is come up with a strategy to convert strings (words) to numbers. This process is known as vectorisation. Only then can this be fed to the model.
Can you name 3 different strategies for accomplishing text vectorisation
One-hot encodings
Create a zero vector with length equal to your vocabulary, and then place a one in the index corresponding to the word.
To then create a sentence by concatenating the one-hot vectors for each word.
Encode each word with a unique number
E.g you could assign 1 to “oh”, 2 to “baby”, 3 to “a”, and 4 to “triple”.
Then the sentence “oh baby a triple” becomes simply
[1, 2, 3, 4]
Word embeddings
Word embeddings are an efficient, dense representation, in which similar words have a similar encoding.
A higher dimensional embedding can capture fine-grained relationships between words, but takes more data to learn.
↳ How efficient is the first method, one-hot encoding?
Very inefficient. Since each word is represented by a vector with only a single 1, every other index is 0.
↳ What percentage of the vectors are 0’s if you are one-hot encoding a vocabulary of 1000 words?
of the elements are zero.
↳ How efficient is the second method?
It is very efficient. An entire sentence can be encoded with a single dense vector, i.e:
"That's what I'm saying" -> [4, 7, 3, 9]
##### ↳ What are two downsides to this approach?
- Arbitrary Labels: mapping the word
houseto the number18does not actually capture any semantic information. - Hard for model interpret: as a result, integer-encodings can be hard for models to interpret. For example, a linear classifier learns a weight for each feature. But there is no relationship between the similarity of two words and the similarity of their encodings.
↳ How efficient is the third method, word embeddings?
Very efficient. Words can be represented as dense vectors, such as "Ruby" = [1.2, -0.1, 4.3, 3.2], and also capture semantic information.
This means that similar words also have similar embeddings.
These embeddings are clearly complex and are thus not done by hand. They are trainable parameters (weights learned by the model during training).
↳ How many dimensions should the embedding be?
Embedding dimensions
A higher dimensional embedding can capture fine-grained relationships between words, but takes more data to learn.
- 8-dimensional for small datasets
- Up to 1024-dimensional for large datasets