Dummy Introduction to Machine Learning

When the uninitiated heard about machine learning, they would think about the nemesis of the Avengers, which is Ultron. The highly intelligent robot which was created by human
accidentally used Artificial Intelligence to move, learn, and think. But here I am going to provide the boring side version of Machine Learning and although it is more
boring than Ultron, but it is still awesome for me. I am going to tell the story of Machine Learning in relation to housing business.

Predict the Price of the House

If you learn Machine Learning through many sources, this is the most favourite topic. Supposed you have data of the relation between size of the house and the price:

size of the house (m2) price
10 100
20 200
30 290
40 420
50 505

Now you train your “robot” with this data so later it can predict the price of the house with different size. If I ask your “robot”, what is the price of the house if the
size of that house is 100 m2, intuitively, without “robot”, anyone can answer, roughly it will be 1000. But to automate the task, we need to use learning algorithm to find
the hypothesis function. Then you feed the size of the house to this hypothesis function to get the price.

h theta (x) = theta0 + theta1 x

Here, x is the size of the house. So the question is how do we choose theta0 and theta1 to get the most optimum hypothesis function?
Looking at the data, we can just answer it easily, theta0 is 0 and theta1 is 10.
That would be a good hypothesis function. The best hypothesis function is the function that has the minimum of the mean of the squared difference between the actual result
and result from our hypothesis function if we feed
the input from our samples to the function. But we have to start somewhere. Basically we just choose “random” theta0 and theta1 and learning rate. With this hypothesis
function, we “retest” our input from our samples. What is the mean of the squared difference between the actual result and the result from our hypothesis function? The ideal result
would be 0. But sometimes it is not possible. At least we must strive to get the minimum result.

What do I mean by calculating the mean of the squared difference between the actual result and the result from our hypothesis function? Consider if we choose theta0 as 5 and theta1 as 3,
we will have this function:

h theta (x) = 5 + 3x

Let’s feed the input to this hypothesis function.

input actual result result from hypothesis function difference
10 100 5 + 3 * 10 = 35 -65
20 200 5 + 3 * 20 = 65 -135
30 290 5 + 3 * 30 = 95 -195
40 420 5 + 3 * 40 = 125 -295
50 505 5 + 3 * 50 = 155 -350

The mean squared difference is (-65 * -65 + -135 * -135 + -195 * -195 + -295 * -295 + -350 * -350) / 5 = 54000.
Why do we square the difference? To avoid the netralization between positive difference and negative difference.

Now, what you want to achieve to the efficiency of the function (usually it is called cost function) is as low as possible.

Consider our hypothesis function that we got from our intuition:

h theta (x) = 0 + 10x

Let’s feed the input to this hypothesis function.

input actual result result from hypothesis function difference
10 100 0 + 10 * 10 = 100 0
20 200 0 + 10 * 20 = 200 0
30 290 0 + 10 * 30 = 300 10
40 420 0 + 10 * 40 = 400 -20
50 505 0 + 10 * 50 = 500 -5

The cost function is (10*10 + -20*-20 + -5*-5) / 5 = 105. This is much closer to 0.

So how do we get these numbers: 0 and 10? That is what machine learning is doing.

If we are going to calculate it mathematically, here is the formula:

theta0 = theta0 – learning_rate * the_derivatives_of_squared_mean_error_with_twist

the_derivatives_of_squared_mean_error_with_twist can be described with this pseudocode:

sum = 0
for input in inputs:
  difference = (theta0 + theta1 * input) - actual_result
  sum += difference
return (sum / number_of_inputs)

theta1 = theta1 – learning_rate * the_derivatives_of_squared_mean_error_with_twist

the_derivatives_of_squared_mean_error_with_twist can be described with this pseudocode:

sum = 0
for input in inputs:
  difference = (theta0 + theta1 * input) - actual_result
  sum += difference * input
return (sum / number_of_inputs)

Choosing the learning rate is quite an art. Too big learning rate would make us miss the local minimum. Too small learning rate would make us slow
in finding the local minimum.

The idea is to get the better theta 0 and theta 1 when we feed them to these functions. These better theta 0 and theta 1 will be used by us as next input to get better theta 0 and theta 1! And so on. Until we get the optimum result.

So let’s try to find the optimum hypothesis function using the starting numbers: 5 and 3. We choose 0.001 as our learning rate.

The calculation would output:

theta0 theta1 squared minimum error
5.208 10.693760000000001 638.9666201600023
5.1849792 9.925074623999999 106.11774910331778
5.18504198208 10.0019412781376 100.80010695760147
5.182798701753791 9.994321911133627 100.73779809847709
5.1807862457180285 9.995144221515096 100.72909361493737

These are first 5 iterations, but it is still not optimum. If we keep going down the rabbit hole:

… after 15000 iterations …

theta0 theta1 squared minimum error
-5.267938224854745 10.280035008944099 78.0974389896887
-5.268071336898212 10.280038639212536 78.09740355791446
-5.26820442473769 10.280042268820877 78.097368139024
-5.268337488377579 10.28004589776924 78.09733273301296
-5.268470527822279 10.280049526057745 78.09729733987675

Here, we can see we are almost on the top.

With these latest values, the cost function that we got is even better than the cost function from our hypothesis that we got from our intuition.
In other words, y = 0 + 10x that we thought is optimum is worse than the more optimal function that we got from this formula: y = -5.268470527822279 + 10.280049526057745 x.

Calculating manually is tiresome, let’s use startegy. We can use programming language and library to get the hypothesis function faster dan better. Use scikit-learn and python.

from sklearn import linear_model
import numpy as np
size = [[10], [20], [30], [40], [50]]
price = [100, 200, 290, 420, 505]
regr = linear_model.LinearRegression()
regr.fit(size, price)

print("theta 0 is %f", regr.coef_)
print("theta 1 is %f", regr.intercept_)
print("squared mean error is %f", np.mean((regr.predict(size) - price) ** 2))

The output:

theta 0 is %d [ 10.3]
theta 1 is %d -6.0
squared mean error is %f 78.0

Next time if we get new data, such as the size of the house, we can predict the price with this optimum hypothesis function as a result of machine learning. Let’s say
the size of the house that we want to predict is 75, what do you think the price will be?

price = theta 0 + theta 1 x
price = -6 + 10.3 * 75
price = 766.5

We call this type of machine learning as linear regression, and it is one of the easiest type. 🙂

Multivariate Linear Regression

But as we know, in real world, the price of the house does not depend on one factor, which is, the size in this previous example. There are many factors that can decide
the price of the house, such as the number of the number of the bedroom, the location, the age, the building material, etc.

size of the house (m2) number of bedrooms location age building material price
10 2 senopati 5 years wood 300
20 5 pluit 2 years concrete 400
20 1 senayan 1 years bamboo 350
28 3 pluit 2 years concrete 420
50 10 kelapa gading 8 years concrete 705

The basic idea is same. There is a formula to find the optimum hypothesis function and scikit-learn has the ability to calculate it.

Logistic Regression

Linear Regression is to predict the result which is continuous. But sometimes you want to predict result which is binary: “yes” or “no”. For example, in housing situation,
based on certain house characteristics (such as number of bedrooms, location, age, building material, price) and potential buyer characteristics (income, age, marital status),
would this buyer take a loan to buy the house. That way, we can “advertise” houses more effectively to certain users.

Neural Network

In our first example, we deal only with one factor or feature, which is the size of the house. But sometimes you have to deal with a large numbers of features, not just 10 or 20, but
50000 features. Using multivariate linear regression will not be efficient. So some smart people invent neural network to deal with this. It’s hard to imagine the case of predicting
the price of the house using 50000 features, but there is a more realistic example. Suppose you want to categorize the pictures of the house automatically. So if you get the picture of
the front side of the house, you want to put it in “Front Outside” automatically. The picture of the kitchen? Put it in the kitchen category. The picture of the garage? Put it in the
garage category.

The image recognition uses a lot of features. You want to train your machine to detect whether certain shapes from the pixels represent the front side of the house or the back side of
the house. You need to feed it with a lot of train samples and tell the machine whether this represents front side of back side or inner side of the house.

Anomaly Detection

We have a series of data coming to us. Sometimes we want to check the outliers. The case in the housing is to detect whether there is an anomaly in the price of the houses in certain
areas. Based on certain characteristics, does this price of the house make sense? Perhaps someone rigs the price of the house. Or you can detect fraud in selling house transaction.

Conclusion

So the age of ultron is coming, I mean, the age of machine is coming. If you want to succeed, for example, in property start-up, you certainly need machine learning. Your technology
stack will be 10x better than the average.