**Softmax** is used when you have a classification problem with more than one category/class. Suppose you want to know if a certain image is an image of a cat, dog, alligator, or onion. Softmax turns your model's outputs into values that represent probabilities. If is similar to sigmoid (used for binary classification), but works for any number of outputs.

**Note**: Softmax is not so good if you want your algorithm to tell you that it doesn't recognize any of the classes in the picture, as it will still assign a probability to each class. In other words, Softmax won't tell you "I don't see any cats, dogs, alligators, or onions."

Steps:

- I am using
*S*here to stand for the function Softmax, which takes*y*as its input (*y*is what we are imagining comes from the output of our machine learning model) - the subscript
*i*lets us know that we do these steps for all of the*y*inputs that we have - the numerator in the formula tells us that we need to raise
*e*to the power of each number*y*that goes through the Softmax function—this is called an exponential - An
**exponential**is the mathematical constant*e*(approximately 2.718) raised to a power. - the denominator in the formula tells us to add together the
*exponentials*of all the numbers from your model's output- ∑ (the Greek capital letter Sigma) stands for "sum", which means "add all the following stuff together"

**What did we just find?**The probability is numerator (which will be different depending on the number*y*) divided by the denominator (which will be the same for all the outputs)

Because we are making the numbers exponential, they will always be positive and the will get very large very quickly.

**Why is this nifty?** If one number *y* among the inputs to Sigmoid is higher, the exponential will make that number *even higher.* Thus, Sigmoid will select the class that is the most likely to be in the image out of all the the classes that it can choose from—and that output will be closer to 1. The closer the number is to 1 the more *likely* that it is of that class

- the outputs will always be in the range between 0 and 1
- the outputs will always add up to 1
- each value of the outputs represents a probability
- taken together the outputs form a probability distribution

- We have our
**inputs**to the Softmax function: -1, 1, 2, 3 - We use each of these inputs on
*e*as an exponent:

e^{-1}, e^{1}, e^{2}, e^{3}

- which gives us approximately these results: 0.36, 2.72, 7.39, 20.09
- we add together the exponentials to form our denominator (we will use the same denominator for each input):

e^{-1} + e^{1} + e^{2} + e^{3}= 30.56

- then we divide each numerator by the denominator that we just found:
- 0.36 / 30.56 = 0.01
- 2.72 / 30.56 = 0.09
- 7.39 / 30.56 = 0.24
- 20.09 / 30.56 = 0.66

- So Softmax turns these inputs into these outputs:
- -1 → 0.01
- 1 → 0.09
- 2 → 0.24
- 3 → 0.66

- and if we add up all the outputs, they equal 1 (try it, it actually works!)

Thanks for reading this week's blog post. I hope you enjoyed it and have a clearer understanding of how the mathematics behind Softmax works.

]]>A neural network can have any number of layers. Each layer has a linear function followed by a non-linear function, called an **activation function.** The activation function takes the output from the linear function and transforms it somehow. That activation becomes the input features for the next layer in the neural network.

Here's a 3 layer neural network to help us visualize the process:

- We have features (x1, x2, x3, x4, x5) that go into the first hidden layer
- this layer calculates the linear functions (y=wx+b) on each input and then puts the result from that calculation through the activation function ReLU(max(0, x)
- the activations output from layer 1 becomes the inputs for layer 2 and the same calculations happen here
- the activations output from layer 2 become the input for layer 3 and then layer 3 does the same linear function-activation function combo
- the last set of activations from layer 3 go through a final activation layer, that will be different depending on what your model is trying to predict: if you have only two classes (a binary classifier), then you can use a
**sigmoid**function, but if you have more than two classes you will want to use**softmax** - those final activations are your predicted labels

Linear models are great, but sometimes non-linear relationships exist and we want to know what they are. Consider the example below.

We want to predict the boundary line between the blue and pink dots (fascinating, I know!).

This is what a linear function, such as logistic regression, can uncover:

This is what a non linear neural network can uncover, which gives us a better visualization of the boundaries between blue and pink dots:

Furthermore, the more hidden layers we add to our network, the more complex relationships we can potentially find in the data—each layer is learning about some feature in the data.

I find it helpful to think of activation functions in two categories (I don't know if this is an "official" distinction, it's just the way I think about them)—activations on hidden units and activations for the final output. The activations for the hidden units exist to make training easier for the neural network, and allow it to uncover non-linear relationships in the data. The activations for the final output layer are there to give us an answer to whatever question we are asking the neural network.

For example, let's imagine we're training a binary classifier that distinguishes between pictures of cats and mice. We might use the **ReLU activation function** on our *hidden* units, but for our *final* output layer we need to know the answer to our question: is this picture of a cat or a mouse? So we will want an activation function that outputs 0 or 1.

Let's take a look at what each different activation function is actually doing.

x here stands for the output from the linear function that is being fed into the activation function

**Sigmoid**converts outputs to be between 0 and 1

**Tanh:**converts numbers to range from -1 to 1—you can picture it as a shifted version of sigmoid. It has the effect of centering the data so that the mean is closer to 0, which improves learning for the following layer.

**ReLU (rectified linear unit): max(0, x)**if the number is negative the function gives back 0, and if the number is positive it just gives back the number with no changes—ReLU tends to run faster than tanh in computations, so it is generally used as the default activation function for hidden units in deep learning**Leaky ReLU: max(0.001x, x)**—if the number x is negative, it gets multiplied by 0.01, but if the number x is positive, it stays the same x

**Sigmoid**is used to turn the activations into something interpretable for predicting the class in**binary classification,**- since we want to get an output of either 0 or 1, a further step to is added:
- decide which class your predictions belong to according to a certain threshold (often if the number is less than 0.5 the output is 0, and if the number is 0.5 or higher the output is 1)
- (Yes, sigmoid is on both lists—that's because it is more useful in deep learning in producing outputs, but it's helpful to know about for understanding Tanh and ReLU).

**Softmax**is used when you need more than one final output, such as in a classifier for more than one category/class. Suppose you want to know if a certain image depicts a cat, dog, alligator, onion, or none of the above. The**motivation:**it provides outputs that are probabilities

- If that formula looks gross to you, come back next week—I plan to break it down step by step until if seems painfully simple

At the end of my article on Decision Trees we looked at some drawbacks to decision trees. One of them was that they have a tendency to overfit on the training data. Overfitting means the tree learns what features classify the training data very well, but isn't so good at making generalizations that accurately predict the testing set.

I mentioned that one way we can try to solve the problem of overfitting is by using a Random Forest. Random Forests are an **ensemble** learning method, so called because they involve combining more than one model into an ensemble. Essentially, we are going to train a bunch of decision trees, and take the majority vote on class predictions for the test data.

**Training data** is the subset of your data that you use to train your model. The tree in the random forest will use this data to learn which features are likely to explain which class a given data sample belongs to.

**Test data** is the subset of your data that you use to make sure your model is predicting properly. It's supposed to simulate the kind of data your model would make predictions on in the real world (for example, if you are making a dog breed classifying app, the test data should mimic the kinds of images you might get from app users, uploading pictures of dogs).

Have you ever been in a classroom and the teacher for some reason asks the class as a whole a question. (Not necessarily a great teaching technique, but let's move on.) You think you know the answer, but you are afraid of being wrong, so you wait a little while until bolder classmates have given their answers before agreeing with what the majority is saying.

This can be a good strategy, if not for learning the material, at least for being right. You wait to see what the consensus is in the classroom before casting your vote.

You're not sure of your answer, but if enough other people agree with you, it seems more likely to you that your answer is the right one.

While this technique isn't probably the best predictor of whether a group of students is learning the material, it can be used to good effect in machine learning.

Building a random forest classifier can be broken into two steps.

- Training, by building a forest of a bunch of different trees, all of which learn from a
**bagged**random sample of the data**** - Making an predictions by taking the predictions of each tree in the forest and taking the majority vote of which class each sample in the test set belongs to

To train a Random Forest, we train each decision tree on *random* groups of the training data, using a technique called **bagging** or **bootstrap aggregating.**

Bagging is *not* dividing the training data into subsets and building a tree from each subset.

Instead, each individual tree randomly grabs samples from the training set.The training set has *n* number of samples. What the tree does is choose *n* number of samples *randomly* from a bag of all the training samples, but after considering each sample it will put it back into the bag before picking out another one. This is called sampling with replacement.

In this very simplified image the different shapes and colors in the bag just represent different samples in the training data (there is no significance intended to the shapes and colors). Each tree grabs a sample from the bag and then puts it back before grabbing another sample, so each tree ends up with a different set of data that it uses to build its tree.

Note that this does mean that any given tree in the random forest might end up with the same sample more than once (as you can see in my little picture). But because there are multiple trees in the forest, and each one chooses samples randomly, there will be enough variation in the trees that it won't really matter too much if samples are repeated for a given tree.

Once the trees are made they can make their predictions. We feed the test set to the trees in our random forest classifier, each tree makes its predictions on the test set, and then we compare the predictions and take the ones that the majority of trees agrees on.

Like decision trees, random forests are straightforward and explainable. Since a random forest is made up of decision trees, we still have access to the Feature Importance (using Scikit-Learn, for example) for understanding the model. Our model can still tell us how important a given feature is in predicting the class of any sample.

Since we are getting the answer from more than one tree, we are able to get an answer that the majority agrees upon. This helps reduce the overfitting we see in a Decision Tree. Each tree in the random forest is searching for the best feature in a random subset of the data, rather than the best feature in *all* of the training set. This helps the model achieve more stable predictions. The Random Forest Classifier *as an ensemble* can't memorize the training data, because each tree in the forest doesn't have access to all the training data when it makes its tree.

**Hyperparameters** is a term used in machine learning to refer to the details of the model that you can tweak to improve its predictive power. These are different from **parameters** which are the actually things that your model uses to compute functions (like the weights **w** and the bias **b** in a linear function).

Some hyperparameters you can tweak in Random Forests are:

- The number of trees—in sklearn (Scikit-Learn's machine learning library) this hyperparameter is called
`n_estimators`

- more trees generally improves the model's predictive power, but also slows down training the model, because there are more trees to build

- The
`n_jobs`

hyperparameter in sklearn tells your computer how many processors to use at once, so if you want to have the model run faster, you can set this hyperparameter to`-1`

, which tells the computer to use as many processors as it has

Other hyperparameters that you can change are the same ones found in Decision Trees—for example, `max_features`

and `min_samples_leaf`

—which I discussed in my post and demonstrated in this Kaggle notebook on Decision Trees.

Random Forests are a handy boost to your baseline Decision Tree model, either in classification or regression problems. You can usually reduce overfitting, while not giving up too much of the model explainability that you have access to with a Decision Tree algorithm.

]]>Problems that work well for traditional machine learning methods are ones that involve structured data—data where the relationship between features and labels is already understood. For example, a table of data that matches some traits of a person (such as age, number of children, smoker or non-smoker, known health conditions, etc.) with the price of that person's health insurance.

With some problems, the relationship between features and targets is less clear, and a neural network will be your best bet to make predictions on the data. These will be problems that involve unstructured data—things like images, audio, and natural language text. For example, what arrangement of pixels in an image of a cat (features) makes it more likely that it as a picture of a cat (label), rather than any other thing.

It's called an artificial neural network, not because it artificially replicates what our brains do, but rather because it is inspired by the biological process of neurons in the brain. A neuron receives inputs with the dendrites (represented by the X, or feature vector in machine learning) and sends a signal to the axon terminal, which is the output (represented by the y, or label vector).

Here's a nice picture I borrowed from Wikipedia that illustrates a neuron, and it's relationship to the inputs and outputs we see in machine learning problems:

So the idea is to do something similar using computers and math. The following little picture represents a single layer neural network, with an input layer that contains the features, a hidden layer that puts the inputs through some functions, and the output layer which spits out the answer to whatever problem we are trying to solve. What actually goes on behind the scenes is just numbers. I don't know if that needs to be said or not, but there aren't actually little circles attached to other circles by lines—this is just a visual way to represent the mathematical interactions between the inputs, hidden layer, and the output.

Inputs could be pixels in an image, then the hidden layer(s) use some functions to try to find out what arrangement of pixels are the ones that represent a cat (our target), and the output layer tells us whether a given arrangement of pixels probably represents a cat or not.

From what I've seen, *neural network* and *deep learning* are used mostly interchangeably, although a neural network is a kind of architecture that is used in the problem space of deep learning.

- Variations on neural networks with special abilities that come from each class of model's specific architecture:
- CNNS: convolutional neural networks—these are used for images
- RNNs: recurrent neural networks—used for something like text, where the order of input (words or characters) matters
- GANs: generative artificial networks—networks that make something new out of the input, like turning an image into another image

The problems that are solved in the deep learning space are so called, because the networks used to solve them have multiple hidden layers between the input and output—they are *deep*. Each layer learns something about the data, then feeds that into the layer that comes next.

In a shallow network, like linear regression, for example, the only layer is linear, and contains a linear function. The model can learn to predict a linear relationship between input and output.

In deep learning there is a linear and a non-linear function, called an activation function, at work in each layer, which allows the network to uncover non-linear relationships in the data. Instead of just a straight line relationship between features and label, the network can learn more complicated insights about the data, by using the functions in multiple layers to learn about the data.

**Example**: In networks that are trained on image data, the earlier layers learn general attributes of image data, like vertical and horizontal edges, and the later layers learn attributes more specific to the problem the network as a whole is trying to solve. In the case of the cat or not-cat network, those features would be the ones specific to cats—maybe things like pointy ears, fur, whiskers, etc.

The exact architecture of the neural network will vary, depending on the input features, what problem is being solved, and how many layers we decide to put between the input and output layers, but the principal is the same: at each layer we have a linear function and an activation function that is fed into the layer that comes next, all the way until the final output layer, which answers the question we are asking about the data.

What's the point of the activation function, anyway? To answer that question, let's look at what would happen if we didn't have an activation function, and instead had a string of linear equations, one at each layer.

- We have our linear equation: y = wx + b
**Layer 1**- Let's assign some values and solve for y:
- w=5, b=10, x=100: y = (5*100) + 10 → y = 510

- 510 is our output for the first layer

- Let's assign some values and solve for y:
**Layer 2:**- We pass that to the next layer, 510 is now the input for this layer, so it's the new x value
- Let's set our parameters to different values:
- w = 4, b = 6, and now we have the equation:
- y = (4*510) + 6 → our new output is y = 2046

**But here's the thing**: instead of doing this two layer process, we could have just set w and b equal to whatever values would let us get 2046 in the first place. For example: w=20 , b=46, which would also give us y = 20 * 100 + 46 = 2046 in a single layer.- Most importantly, we won't achieve a model that recognizes non-linear relationships in data while only using linear equations.

It doesn't matter how many layers of linear equations you have—they can always be combined into a single linear equation by setting the parameters w and b to different values. Our model will always be linear unless we introduce a non-linear function into the mix. That is why we need to use activation functions. We can string together multiple linear functions, as long as we separate each one by an activation function, and that way our model can do more complex computations, and discover more complicated relationships in the data.

If you have a lot of data and a problem you want to solve with it, but you aren't sure how to represent the structure of that data, deep learning might be for you. Images, audio, and anything involving human language are likely culprits for deep learning, and each of those problems will have its own flavor of neural network architecture that can be used to solve it.

]]>Last week we explored Decision Trees, and I said that the algorithm chooses the split of data that is most advantageous as it makes each branch in the tree. But how does the algorithm know which split is the most advantageous? In other words, what criterion does the algorithm use to determine which split will help it answer a question about the data, as opposed to other possible splits?

Entropy is this criterion.

Entropy measures randomness. But that might sound a little abstract if you aren't used to thinking about randomness, so instead let's shift our attention to something concrete: socks.

**1st Situation**: Imagine you have a laundry basket filled with 50 red socks and 50 blue socks. What is the likelihood that you will pull a red sock out of the basket? There is a 50/50 chance that it will be a red sock, so the chance is 0.5. When there is an equal chance that you will get a red sock or a blue sock, we say that the information is very random - there is high entropy in this situation.

**2nd Situation**: Now you divide those socks into two different boxes. You put the 50 red socks into a box, labeled Box 1, and the 50 blue socks into another box, labeled Box 2. The likelihood that you will pull a red sock out of Box 1 is 1.0 and the likelihood that you will pull a red sock out of Box 2 is 0. Similarly, the likelihood that you'll pull a blue sock out of Box 2 is 1.0 and the likelihood that you will pull a red sock out of that same box is 0. There is no randomness here, because you know whether you will get a red sock or a blue sock. In this situation, there is no entropy.

**3rd Situation**: But let's suppose that we split up the basket of socks another way - 50 socks in Box 1 and 50 socks in Box 2, but we kept the *distribution* of the socks the same - so there are 25 red socks in Box 1 and 25 blue socks. And in Box 2 there are also 25 red socks and 25 blue socks. In this situation, although we have divided the total amount of socks in half, the entropy is the *same* as the first situation, because the information of whether you will grab a red sock or a blue sock is just as unpredictable as the first example.

So we can say that entropy measures the unpredictability or randomness of the information within a given distribution.

Just so you know, this concept is also called impurity. In the 2nd situation we reduced the entropy, which gives us low impurity. In the 1st and 3rd examples we had high impurity.

A Decision Tree is tasked with reducing the impurity of the data when it is making a new branch in the tree. With those situations above, a better split of the data would be the kind seen in Situation 2, where we separated the blue socks from the red socks. A bad split would be Situation 3, because we didn't make it any easier for us to guess whether a sock would be red or blue.

Why do we want to reduce the randomness/impurity? Remember, the goal of a Decision Tree is to predict which class a given sample belongs to, by looking at all the features of that sample. So we want to reduce the unpredictability as much as possible, and then make an accurate prediction.

This is called information gain. We are measuring how much information a given feature gives us about the class that a sample belongs to.

- We have a sample
- We know the value of a feature of that sample (for example, in Box 1 or Box 2)
- Does knowing that feature reduce the randomness of predicting which class that sample belongs to?
- If it does, than we have reduced the entropy and achieved information gain

The term information gain is more or less self-explanatory, but in case it is less for you: it is telling us how much we learned from a feature. We want to find the most informative feature — the feature that we learn the most from. And when a feature is informative we say we have gained information from it, hence information gain.

I'm going to do my best to break down the mathy part now, for calculating entropy and information gain.

**Here is some notation for probabilities:**

- True (i.e. a sample belongs to a certain class): 1
- False (i.e. a sample doesn't belong to a certain class): 0
- A given sample: X
- Probability (likelihood) that given sample X is of a certain class (True):
*p*(X = 1) - Probability that X is not of a certain class (False):
*p*(X = 0)

Notation for **entropy**:

H(X)=−*p*(X=1)*log*_{2}(*p*(X=1))−*p*(X=0)*log*_{2}(*p*(X=0))

- H(X): entropy of X
- Note: when
*p*(X=1)=0 or*p*(X=0)=0 there is no entropy

Notation for** information gain**:

IG(X,a) = H(X)−H(X|a)

- IG(X,a): information gain for X when we know an attribute
*a* - IG(X, a) is defined as the entropy of X, minus the conditional entropy of X given
*a* - The idea behind information gain is that it is the entropy for a given attribute
*a*that we*lose*if we know the value of*a*

**Specific conditional entropy**: for any actual value *a*_{0} of the attribute *a* we calculate:

H(X|a=a_{0})=−p(X=1|a=a_{0})log_{2}(p(X=1|a=a_{0}))−p(X=0|a=a_{0})log_{2}(p(X=0|a=a_{0}))

- So let's imagine there are several different possible attribute values for a that are numbered as follows: a
_{0}, a_{1}, a_{2},...a_{n} - We want to calculate the specific conditional entropy for each of those values...

**Conditional entropy**: for all possible values of *a*

H(X|a)=∑_{ai}p(a=a_{i})⋅H(X|a=a_{i})

- Entropy of X given
*a* - We add together all the values for specific conditional entropy of each attribute a
_{0}through a_{n}... - ...that's what ∑
_{ai}p(a=a_{i}) stands for - ∑ means "sum"
- It is
*conditional*because it is the entropy that depends on an attribute*a*

So here are the steps you can take with those equations to actually find the conditional entropy for a given attribute:

- Calculate each probability for your first possible value of
*a*: - Calculate the specific conditional entropy for the probability found in Step 1
- Multiply the values from step 1 and step 2 together
- Do that for each possible value for
*a* - Add together all those specific conditional entropy values you found for
*a* - ...the sum reached in Step 5 is your conditional entropy

Then to calculate information gain for an attribute *a:*

- Subtract the conditional entropy from the entropy
- At this point you have found the value for your information gain with respect to that attribute
*a*

What you have now discovered is how random your data *really* is, if you know the value of a given attribute *a* (i.e. a given feature). Information gain lets us know how much knowing *a* reduces the unpredictability.

And this is what the decision tree algorithm does. It uses the entropy to figure out which is the most informative question to ask the data.

This awesome Twitter thread by Tivadar Danka gives a more detailed breakdown into the mathematical concept of entropy, and why there are all those logarithms in the equation. And while we're on the subject, I highly recommend that you follow him on Twitter for great threads on math, probability theory, and machine learning.

If you want a refresher on what logarithms are, check out this very no-nonsense explanation from Math Is Fun. I also highly recommend visiting this cite before Wikipedia for all of your math basics needs. It is written with a younger audience in mind, which is perfect if your brain isn't used to all of the fancy math lingo and symbols yet.

**Citations**
In case you're wondering, my understanding of entropy was drawn from the book *Doing Data Science* by Rachel Schutt and Cathy O'Neil and the Udacity Machine Learning course lectures.

I hope you enjoyed this Machine Learning Log, and please share any thoughts, questions, concerns, or witty replies in the comments section. Thank you for reading!

]]>The next morning you wake up. No headache, running shorts are clean, but it's raining, so you decide not to go for a run.

But if you had answered yes to all three questions, then you would be going for a run.

You can use a flow chart to represent this thought process.

That flow chart is a simple decision tree. Follow the answers and you will reach the conclusion about whether you will run or not tomorrow.

We can teach a computer to follow this process. We want the computer to categorize some data by asking a series of questions that will progressively divide the data into smaller and smaller portions.

We will get the computer to:

- Ask a yes or no question of all the data...
- ...which splits the data into 2 portions based on answer
- Ask each of those portions a yes or no question...
- ...which splits each of those portions into 2 more portions (now there are 4 portions of data)
- Continue this process until...
- All the data is divided
- Or we tell it to stop

This is an algorithm, and it is called **recursive binary splitting**. *Recursive* means it's a process that is repeated again and again. *Binary* means there are 2 outcomes: yes or no / 0 or 1. And *splitting* is what we call dividing the data into 2 portions, or as they're more fancily known, *splits*.

**How does the algorithm decide which way to split up the data?** It uses a cost function and tries to reduce the cost. Whichever split reduces the cost the most will be the split that the algorithm chooses.

**What is reducing the cost?** Briefly, cost is a measure of how wrong an answer is. In a decision tree, we are trying to gain as much information as possible, so a split that reduces the cost is one which will group similar data into similar classes. For example, say you are trying to sort a basket of socks based on if they are red or blue. A bad split would be to divide the basket of socks in half but keep the same ratio of red and blue socks. A good split would be to put all the red socks in one pile and all the blue socks in another. (A decision tree algorithm uses **entropy** to measure the cost of each split, but that discussion is beyond the scope of this article.)

We can use decision trees in classification problems (predicting whether or not an item belongs to a certain group or not) or regression problems (predicting a discrete value).

Then when we have some new data, we compare the features of the new data to the features of the old data. Whichever split a given sample from the new data matches up with, is the category that it belongs to.

- a given test sample belongs to the split where the training samples had the same set of features as that test sample
- For Classification problems: at the end the prediction is 0 or 1, whether the item belongs to a class
- For Regression problems:
- We assign a prediction value to each group (instead of a class)
- The prediction value is the target mean of the items in the group

Now let's look at an actual dataset so we can see how a decision tree could be useful in machine learning.

We're going to look at the mushrooms dataset from Kaggle. We have over 8,000 examples of mushrooms, with information about their physical appearance, color, and habitat arranged in a table. About half of the samples are poisonous and about half of the samples are edible.

Our goal will be to predict, given an individual mushroom's features, if that mushroom is edible or poisonous. This makes it a **binary classification** problem, since we are sorting the data into 2 categories, or classes.

**By the way**, I don't recommend you use the model we produce to actually decide whether or not to eat a mushroom in the wild.

Here, you can find a Kaggle Notebook to go along with the example discussed in this article. I provided a complete walkthrough of importing the necessary libraries, loading the data, splitting it up into training and test sets, and making a decision tree classifier with the scikit-learn library.

At a certain point you have to stop dividing the data up further. This will naturally happen when you run out of data to divide. But that would be after we have a massive tree of over 8,000 leaf nodes - one for each sample in our training data! That would not be very useful, because we want to have a tree that generalizes well to new data. If we wait too long and let our algorithm split the data into too many nodes, it will overfit. This means it will understand the relationships between features and labels in the training data really well - too well - and it won't be able to predict the class of new data samples that we ask the model about.

Some criteria for stopping the tree:

- Setting the max depth: tell the algorithm to stop dividing the data when it gets to a certain number of nodes
- Setting the minimum number of samples required to be at a
*leaf*node - Setting the minimum number of samples required to split an
*internal*node- this is helpful if we want to avoid having a split for just a few samples, since this would not be representative of the data as a whole

For example, here is how we can set the max depth with sklearn's library:

```
from sklearn import DecisionTreeClassifier
model_shallow = DecisionTreeClassifier(max_depth=4, random_state=42)
model_shallow.fit(X_train, y_train)
```

This will yield a tree that only has 4 decision nodes total.

A decision tree algorithm is a type of **greedy algorithm**, which means that it wants to reduce the cost as much as possible each time it makes a split. It chooses the *locally optimal solution* at each step.

This means the decision tree may not find the globally optimal solution--the solution that is best for the data as a whole. This means that at each point where it needs to answer the question "is this the best possible split for the data?" it answers that question for that one node, at that one point in time.

This means that the decision tree will learn the relationship between features and targets in the training data really well but it won't generalize well to new data. This is called **overfitting**.

One way we can deal with this overfitting is to use a Random Forest, instead of a Decision Tree. A Random Forest takes a bunch of decision trees and then uses the average prediction from all of the trees to predict the class of a given sample.

Decision trees are fairly easy to visualize and understand. We say that they are *explainable*, because we can see how the decision process works, step by step. This is helpful if we want to understand which features are important and which are not. We can use the decision tree as a step in developing a more complicated model, or on its own. For example, we can use `feature_importances_`

to decide which features we can safely trim from our model without it performing worse.

A Decision Tree is an excellent starting point for a classification problem, since it will not just give you predictions, but help you understand your data better. As such, it is a good choice for your baseline.

**node**: the parts of the tree that ask the questions**root**: the first node--creates the initial split of data into 2 portions**branches**or**edges**: internal nodes--they come between the root node and the leaf nodes**decision node**or**leaf node**: when we reach the end of a sequence of questions, this is the node that gives the final answer (for example, of what class a sample belongs to)**split**: the portion of data that results from splitting

**training data**: the data used to fit the model**validation data**: data used to fine-tune the model and make it better (we left out that step in the Kaggle notebook)**testing data**: the data used to test if the model predicts well on new information**instance / sample**: one example from a portion of your data - for example, a single mushroom**algorithm**: step by step instructions that we give to a computer to accomplish some task**baseline**: a simple model we train at the beginning stages of exploring our data to gain insights for improving our predictions later on (all future models will be compared to this one)

If you enjoyed this article, please take a look at the Kaggle notebook that I made to go with it. It is a beginner friendly example of using the Mushrooms dataset to build a decision tree, evaluate it, and then experiment a bit with the model.

Additionally, I'd love to get feedback about the format of breaking the general overview of a topic apart from the code notebook. I felt that both could stand on their own, so someone could go through the code example to see how it works, or someone could read this article. If you want to read both, hey that's cool too!

Thank you for reading!

]]>(If you don't like personal anecdotes, skip to Tips 😉)

Okay so you're interested in machine learning and you ask Google "what do I need to know to start machine learning?"

"Learn calculus, probability, statistics, linear algebra, learn to code, and then you can start learning machine learning," Google tells you.

Your heart sinks.

Maybe you haven't touched math since high school. Being told you need to learn that amount of math just to get started might be enough to send you away in discouragement, never to revisit the idea of machine learning.

It did me at first.

"Stick to learning web dev. Everyone says it is where people without a tech background should go," I told myself again and again.

The thing is, the more I tried to learn web development, the less interested in it I became. While the more I thought about machine learning, the more I just wanted to find out what it was all about.

Several false starts at learning algebra so I could learn precalculus so I could learn calculus so I could learn linear algebra later, I realized it would be months and months before I could start actually playing around with machine learning models.

After one more lackluster attempt at building an ecommerce site with React, I finally just started over. "What's the worst that could happen? I only know a little about matrices, I have no exposure to calculus yet, but it's not going to hurt anything to just see what it's like." So I started studying machine learning, developing intuitions for the math concepts machine learning is built on as I needed them.

Okay, with that personal anecdote out of the way...

- Do
*not*start by learning probability, statistics, linear algebra, and multivariate calculus. Machine learning is built on that mathematical foundation - so it absolutely is important (don't let anyone tell you otherwise) - but you don't need to start there. That is to say, they may be prerequisites in a course catalog, and topics that will help you understand machine learning algorithms more quickly, but if you are learning on your own and start there, you may never get to the point where you feel ready to begin with machine learning. - Begin with an introduction to machine learning and when you come to a term you don't understand - for example vector, tensor, function, mean - look it up!
- Understand what is the
**motivation**for using that particular element of math. For example, before learning what a derivative*does*- before looking at the mathematical formulas - find out*why*they are used in the first place. - If you don't understand a mathematical formula break it down into the smallest components that you
*do*understand. - Alternately, if you aren't able to break down the formula into smaller components, find a different representation of the concept - Look for a:
- Code snippet
- Picture
- Video explanation
- Worked out algebraic example
- All of the above!

- The internet is your friend. If you don't understand one explanation that does not mean you are "not a math person." Instead it may mean you need to learn the concept from a different angle. There is no shame in seeking out another resource if the first one isn't serving you. Don't just keep smashing your head into a brick wall and hope you'll somehow get through it.

I also want to address the fact that many people consciously or subconsciously feel their self worth is tied to being good or not so good at mathematics. Or, at least, I suspect I'm not the only one.

When I can't remember how to do something in algebra, it still shatters my self esteem. This is probably for a variety of reasons - one of them is probably because I spent a lot of time (4 years?) thinking that if I couldn't solve algebra problems, I wouldn't do well on the SAT, and wouldn't be able to prove to people that I have good logical reasoning skills.

This could be you, if you've ever thought of yourself as "being bad at math" or "not a math person" or "more of a creative person" (or any other euphemisms for *bad at math*). If maybe you've heard the phrase "you should be able to figure this out with basic high school algebra," and you couldn't. Or in a math explanation someone said "it clearly follows," and it didn't.

You need to start shedding those identities right away. Start thinking of yourself as someone who is learning math. Someone who doesn't know everything yet, but is building their intuitions and is curious to discover more.

You are not unintelligent because you don't understand math. Math isn't obvious. It isn't something we intrinsically know. (Just ask any 5 or 6 year old!)

Math is a learned skill, and as such I believe anyone can learn it. I don't think there are people who are good at math and people who are bad at math. I think that the idea of "a math person" probably has more to do with how well that person learned math the way it was taught to them.

If your brain did well with how math was taught in your school, then you probably excelled in math subjects. But maybe you weren't so lucky. Maybe you've always felt some underlying inferiority because you didn't succeed at math in the classroom. Maybe you've been avoiding math for the rest of your life.

Well I'm here to tell you math is nothing to be afraid of. It is beautiful, it is a useful tool, and you can learn it too.

I got a D in precalculus because I was intimidated by my teacher, never went to office hours, and was so thoroughly confused I didn't even know what questions to ask. And I'm learning how to use partial derivatives in gradient descent from the internet. Will I ever be a calculus expert? Probably not, but that isn't preventing me from learning what I can.

Start with what you know, and build up to what you don't know one step at a time.

What are your thoughts about math? Share them with me in the comments! I'd love to have a conversation with you.

]]>I will also link an article which helped me understand the math behind partial derivatives, which you can look at to fill in some details I won't be covering here.

The goal of gradient descent is to minimize the loss. In an ideal world we want our loss to be 0 (but keep in mind that this isn't realistically possible). We minimize the loss by improving the parameters of the model, which are the weight w and the bias b in linear regression. We improve those parameters either by making them larger, or smaller - whichever makes the loss go down.

Read my article on Linear Regression

Read my article on Mean Squared Error from last week

Gradient descent is an iterative process - this is just a fancy way to say that the process repeats over and over again until you reach some condition for ending it.

The condition for ending could be:

- We are tired of waiting: i.e. we let gradient descent run for a certain number of iterations and then tell it to stop
- The loss is minimized as much as we need for our problem: i.e. the loss is equal to or less than a certain number that we decide on

This is where derivatives come in.

For a given function:

- It tells us how much a change in the weight will change the output of a function For example, for the MSE loss function:
- How much will changing w a little bit change the loss?
- Basically, the derivative tells us the slope of the line at a given point

A partial derivative is a derivative with more than one variable. In the linear regression equation we have w and b which both can change, so there are two variables that can affect the loss. We want to isolate each of those variables so that we can figure out how much w affects the loss and how much b affects the loss separately.

- So we measure the derivative, or the slope one variable at a time
- Whichever variable we are
*not*measuring, we make a constant, by setting it equal to 0 - First we calculate the derivative of the loss with respect to w
- And then we calculate the derivative of the loss with respect to b

Rather than illustrating the formula for partial derivatives of MSE here (which I am still learning to understand myself), I am going to include a link to a *very* helpful article that goes through the mathematical formula step by step for finding the partial derivatives of mean squared error. The author basically does what I was hoping to do in this article before I became a little overwhelmed by the amount of background I would need to provide.

Now that we have calculated the derivatives we need to actually use them to update the parameters w and b.

We will use something called the Learning Rate to tell us how big of a step to take in our gradient descent. It is called the learning rate, because it affects how quickly our model will learn the patterns in the data. What do we do with it? We use it to multiply the derivative with respect to w and b when we update w and b in each iteration of training our model.

So, in short, it's a number that controls how quickly our parameters w and b change. A lower learning rate will cause w and b to change slowly (the model learns slower), and a higher learning rate will cause w and b to change more quickly (the model learns faster).

Remember in my overview of linear regression article I discussed how after we find the loss we'll need to use that information to update our weight and bias to minimize the loss? Well we're finally ready for that step.

A quick summary before we get started with the code. We have a forward pass, where we calculate our predictions and our current loss, based on those predictions. Then we have a backward pass, where we calculate the partial derivative of the loss with respect to each of our parameters (w and b). Then, using those gradients that we gained through calculating the derivatives, we train the model by updating our parameters in the direction that reduces the loss. We use the learning rate to control how much those parameters are changed at a time in each iteration of training.

This is called the forward pass:

So we initialize our parameters

- we can start them off at 0
- or we can start them off at random numbers (but I've decided to start them at 0, to simplify the code)

We calculate linear regression with our current weight and bias

- We calculate the current loss, based on the current values for w and b

```
# import the Python library used for scientific computing
import numpy as np
# predict function, based on y = wx + b equation
def predict(X, w, b):
return X * w + b
# loss function, based on MSE equation
def mse(X, Y, w, b):
return np.average(Y - (predict(X, w, b)) ** 2)
```

This part is called the backward pass:

- Using the current loss we calculate the derivative of the loss with respect to w...
- ...and with respect to b

```
# calculate the gradients
def gradients(X, Y, w, b):
w_gradient = np.average(-2 * (X * (predict(X, w, b) - Y)))
b_gradient = np.average(-2 * (predict(X, w, b) - Y))
return (w_gradient, b_gradient)
```

- Then we update the weight and bias with the derivative of the loss in the direction that minimizes the loss, by multiplying each derivative with the learning rate
- Then we repeat that process as long as we want (set in the number of epochs) to reduce the loss as much as we want

```
# train the model
# lr stands for learning rate
def train(X, Y, iterations, lr):
# initialize w and b to 0
w = 0
b = 0
# empty lists to keep track of parameters current values and loss
log = []
mse = []
# the training loop
for i in range(iterations):
w_gradient, b_gradient = gradient(X, Y, w, b)
# update w and b
w -= w_gradient * lr
b -= b_gradient * lr
# recalculate loss to see
log.append((w,b))
mse.append(mse(X, Y, w, b)
return w, b, log, mse
```

**A parting note:** There are tricks to avoid using explicit loops in your code, so that the code will run faster, when we start to train very large datasets. But to give an idea of what is going on, I thought it made sense to visualize the `train`

function as a loop.

I hope you enjoyed this overview of Gradient Descent. My code might not be very eloquent, but hopefully it gives you an idea of what's going on here.

If you like this style of building up the functions used in machine learning models a little bit at a time, you may enjoy this book, *Programming Machine Learning*, whose code I relied on in preparing this article.

Last week we left off our discussion of Linear Regression with recognizing the need for a loss function. After a first look at our prediction, we found that we calculated that a 100 year old violin would cost about $54, instead of the real world value of $800. Clearly, there is some serious work to be done so that our predictions can become more accurate to the real world values. This process is called optimization. And the process begins with a loss function.

Linear regression is based on the equation y = wx + b

- We are predicting the y value (the violin's price) based on
- the x value - our feature (the violin's age)
- multiplied by the weight w
- and added to the bias b, which controls where the line intercepts the y-axis
- at first w and b are initialized to random weights (in other words, an uninformed guess)

Once we get our first prediction (using the random values for w and b that we chose) we need to find out how wrong it is. Then we can begin the process of updating those weights to produce more accurate values for y.

But first, since so much of machine learning depends on using functions...

- I like to think of functions as little factories that take in some numbers, do some operations to them, and then spit out new numbers
- the operations are always the same
- so, for example, if you put in 2 and get out 4 you will always get out 4 whenever you put in 2
- basically a function defines a relationship between an independent variable (x) and a dependent variable (y)

I'll use mean squared error in this article, since it is a popular loss function (for example, you'll find it is used as the default metric for linear regression in the scikit learn machine learning library).

**Purpose**: What are we measuring? The error, also called loss - i.e. how big of a difference there is between the real world value and our predicted value.

**The math**:
We are going to...

- Subtract the predicted y value from the actual value
- Square the result

That would be it for our tiny example, because we made one prediction, so we have only one instance. In the real world, however, we will be making more than one prediction at a time, so the next step is:

- Add together all the squared differences of all the predictions - this is the squared error
- Divide the sum of all the squared errors by the number of instances (i.e. take the mean of the squared error)

**Bonus**: if you want to find the Root Mean Squared Error (an evaluation metric often used in Kaggle competitions), you will take the square root of that last step (You might use RMSE instead of MSE since it penalizes large errors more and smaller errors less).

**The equation**:

The mathematical formula looks like this.

At first it might look intimidating, but I'm going to explain the symbols so that you can start to understand the notation for formulas like this one.

*n*stands for the number of instances (so if there are 10 instances - i.e. 10 violins,*n*=10)- The huge sigma (looks like a funky E) stands for "sum"
- 1/
*n*is another way of writing "divide by the number of instances", since multiplying by 1/*n*is the same as dividing by*n* - Taken together, this part of the formula just stands for "mean" (i.e. average):
- Then Y stands for the observed values of y for all the instances
- And Ŷ (pronounced Y-hat) stands for the predicted y values for all the instances
- Taken together, this part of the formula just stands for "square of the errors":
- What do I mean by all the instances? That's if you made predictions on more than one violin at a time, and put all of those predictions into a matrix

The code: In the example below

- X stands for all the instances of x - the features matrix
- Y stands for all the instances of y - the ground truth labels matrix
- w and b stand for the same things they did before: weight and bias

```
# import the Python library used for scientific computing
import numpy as np
# last week's predict function
def predict(X, w, b):
return X * w + b
def mse(X, Y, w, b):
return np.average(Y - (predict(X, w, b)) ** 2)
```

At this point we know how wrong we are, but we don't know how to get less wrong. That's why we need to use an optimization function.

This is the part of the process where the model "learns" something. Optimization describes the process of gradually getting closer and closer to the ground truth labels, to make our model more accurate. Based on the error, update the weights and bias a tiny bit in the direction of less error. In other words, you are making the predictions a little bit more correct. Then you check your loss again, and rinse and repeat until you decide you have minimized the error enough, or you run out of time to train your model - whichever comes first.

Now, I know I said I would discuss gradient descent this week, but I am starting to feel that doing it justice probably requires a whole post. Either this week's post will get too long, or I won't be able to unpack the details of gradient descent to a simple enough level. So, please bear with me and stay tuned for next week's Machine Learning Log, in which I *will* (promise!) discuss gradient descent in detail.

In this post I will give an explanation of linear regression, which is a good algorithm to use when the relationship between values is, well, linear. In this post we'll imagine a linear relationship where when the x value (or feature) increases, so does the y value (the label).

With enough labelled data, we can employ linear regression to predict the labels of new data, based on its features.

**Note**: This example will only deal with a regression problem with one variable

y = wx + b

- y is the dependent variable - the thing you are trying to predict
- x is the independent variable - the feature - when x changes it causes a change in y
- w is the slope of the line - in our machine learning example it is the weight - x will be multiplied by this value to get the value for y
- b is the bias - it tells us where our line intercepts the y-axis - if b = 0 then line is at the origin of the x and y axes, but this won't always fit our data too well, which is why setting the bias is useful

We want to figure out what values to use for the weight and bias so that the predictions are most accurate when compared to the ground truth.

Remember my violin price predictions example from Machine Learning Log 3: a brief overview of supervised learning? In that article we talked about using different features to predict the change in the price of a violin, where the overall goal was to predict the price of a given violin.

To make it simpler for this article we are going to consider only one variable for our price prediction. We will predict the price of the violin only depending on the age of the violin. We'll pretend that there is a linear relationship such that when the age of the violin goes up the price also goes up (this is vaguely true in the real world).

So as we have it now:

- y = violin price
- x = violin age
- w = ?
- b = ?

We want to find out what values to use for the weight w and the bias b so that we can fit the slope of our line to our data. Then we can use this information to predict the price y from a new feature x.

Here's a little graph I drew to represent some possible data points for violins' prices and ages. You can see that when the age goes up so does the price. (This is fictional data, but violins do tend to appreciate in value.)

Let's say we have a 100-year-old violin, that costs, I don't know $8000. We have no idea what to use for w and b yet, so to make our prediction we'll start our weight and bias off at some random number:

Since the weight and bias are started off at random numbers, we're basically taking a stab in the dark guess at this point.

In this case the result of running my little code above was `y_prediction = 56.75`

which is *very* wrong (since our ground truth is $8000). This isn't a surprise, since we set our weight and bias to random numbers to begin with. Now that we know how wrong our guess was, we will update the weight and bias to gradually get closer to an accurate prediction that is closer to the real world price of $8000. That is where a cost function and an optimizer come in.

In reality we would be running the predictions with more than just one sample from our dataset of violins. Using just one violin to understand the relationship between age and price won't give us a very accurate predictions, but I presented it that way to simplify the example.

This is where stuff really gets fun! In order to learn from our mistakes we need to have a way to mathematically represent how wrong we are. We do this by comparing the actual price values to the predicted values we got. We find the difference between those values. We use that difference to adjust the weights and bias to fit our values more closely. We rinse and repeat, getting incrementally less wrong, until our predictions are as close as possible to the real world values.

Tune in next week as I explain this process! I'll look at a cost function and describe the process of gradient descent. But I think that's enough for this week. I want to keep these posts fairly short so they can be written and consumed more or less within a Sunday afternoon. 😉

I am still learning and welcome feedback. As always, if you notice anything extremely wrong in my explanation, or have suggestions on how to explain something better, please let me know in the comments!

]]>