When the uninitiated heard about machine learning, they would think about the nemesis of the Avengers, which is Ultron. The highly intelligent robot which was created by human

accidentally used Artificial Intelligence to move, learn, and think. But here I am going to provide the boring side version of Machine Learning and although it is more

boring than Ultron, but it is still awesome for me. I am going to tell the story of Machine Learning in relation to housing business.

**Predict the Price of the House**

If you learn Machine Learning through many sources, this is the most favourite topic. Supposed you have data of the relation between size of the house and the price:

size of the house (m2) | price |

10 | 100 |

20 | 200 |

30 | 290 |

40 | 420 |

50 | 505 |

Now you train your “robot” with this data so later it can predict the price of the house with different size. If I ask your “robot”, what is the price of the house if the

size of that house is 100 m2, intuitively, without “robot”, anyone can answer, roughly it will be 1000. But to automate the task, we need to use learning algorithm to find

the hypothesis function. Then you feed the size of the house to this hypothesis function to get the price.

*h theta (x) = theta0 + theta1 x*

Here, x is the size of the house. So the question is how do we choose theta0 and theta1 to get the most optimum hypothesis function?

Looking at the data, we can just answer it easily, theta0 is 0 and theta1 is 10.

That would be a good hypothesis function. The best hypothesis function is the function that has the minimum of the mean of the squared difference between the actual result

and result from our hypothesis function if we feed

the input from our samples to the function. But we have to start somewhere. Basically we just choose “random” theta0 and theta1 and learning rate. With this hypothesis

function, we “retest” our input from our samples. What is the mean of the squared difference between the actual result and the result from our hypothesis function? The ideal result

would be 0. But sometimes it is not possible. At least we must strive to get the minimum result.

What do I mean by calculating the mean of the squared difference between the actual result and the result from our hypothesis function? Consider if we choose theta0 as 5 and theta1 as 3,

we will have this function:

*h theta (x) = 5 + 3x*

Let’s feed the input to this hypothesis function.

input | actual result | result from hypothesis function | difference |

10 | 100 | 5 + 3 * 10 = 35 | -65 |

20 | 200 | 5 + 3 * 20 = 65 | -135 |

30 | 290 | 5 + 3 * 30 = 95 | -195 |

40 | 420 | 5 + 3 * 40 = 125 | -295 |

50 | 505 | 5 + 3 * 50 = 155 | -350 |

The mean squared difference is (-65 * -65 + -135 * -135 + -195 * -195 + -295 * -295 + -350 * -350) / 5 = 54000.

Why do we square the difference? To avoid the netralization between positive difference and negative difference.

Now, what you want to achieve to the efficiency of the function (usually it is called cost function) is as low as possible.

Consider our hypothesis function that we got from our intuition:

*h theta (x) = 0 + 10x*

Let’s feed the input to this hypothesis function.

input | actual result | result from hypothesis function | difference |

10 | 100 | 0 + 10 * 10 = 100 | 0 |

20 | 200 | 0 + 10 * 20 = 200 | 0 |

30 | 290 | 0 + 10 * 30 = 300 | 10 |

40 | 420 | 0 + 10 * 40 = 400 | -20 |

50 | 505 | 0 + 10 * 50 = 500 | -5 |

The cost function is (10*10 + -20*-20 + -5*-5) / 5 = 105. This is much closer to 0.

So how do we get these numbers: 0 and 10? That is what machine learning is doing.

If we are going to calculate it mathematically, here is the formula:

*theta0 = theta0 – learning_rate * the_derivatives_of_squared_mean_error_with_twist*

the_derivatives_of_squared_mean_error_with_twist can be described with this pseudocode:

sum = 0 for input in inputs: difference = (theta0 + theta1 * input) - actual_result sum += difference return (sum / number_of_inputs)

*theta1 = theta1 – learning_rate * the_derivatives_of_squared_mean_error_with_twist*

the_derivatives_of_squared_mean_error_with_twist can be described with this pseudocode:

sum = 0 for input in inputs: difference = (theta0 + theta1 * input) - actual_result sum += difference * input return (sum / number_of_inputs)

Choosing the learning rate is quite an art. Too big learning rate would make us miss the local minimum. Too small learning rate would make us slow

in finding the local minimum.

The idea is to get the better theta 0 and theta 1 when we feed them to these functions. These better theta 0 and theta 1 will be used by us as next input to get better theta 0 and theta 1! And so on. Until we get the optimum result.

So let’s try to find the optimum hypothesis function using the starting numbers: 5 and 3. We choose 0.001 as our learning rate.

The calculation would output:

theta0 | theta1 | squared minimum error |

5.208 | 10.693760000000001 | 638.9666201600023 |

5.1849792 | 9.925074623999999 | 106.11774910331778 |

5.18504198208 | 10.0019412781376 | 100.80010695760147 |

5.182798701753791 | 9.994321911133627 | 100.73779809847709 |

5.1807862457180285 | 9.995144221515096 | 100.72909361493737 |

These are first 5 iterations, but it is still not optimum. If we keep going down the rabbit hole:

… after 15000 iterations …

theta0 | theta1 | squared minimum error |

-5.267938224854745 | 10.280035008944099 | 78.0974389896887 |

-5.268071336898212 | 10.280038639212536 | 78.09740355791446 |

-5.26820442473769 | 10.280042268820877 | 78.097368139024 |

-5.268337488377579 | 10.28004589776924 | 78.09733273301296 |

-5.268470527822279 | 10.280049526057745 | 78.09729733987675 |

Here, we can see we are almost on the top.

With these latest values, the cost function that we got is even better than the cost function from our hypothesis that we got from our intuition.

In other words, *y = 0 + 10x* that we thought is optimum is worse than the more optimal function that we got from this formula: *y = -5.268470527822279 + 10.280049526057745 x*.

Calculating manually is tiresome, let’s use startegy. We can use programming language and library to get the hypothesis function faster dan better. Use scikit-learn and python.

from sklearn import linear_model import numpy as np size = [[10], [20], [30], [40], [50]] price = [100, 200, 290, 420, 505] regr = linear_model.LinearRegression() regr.fit(size, price) print("theta 0 is %f", regr.coef_) print("theta 1 is %f", regr.intercept_) print("squared mean error is %f", np.mean((regr.predict(size) - price) ** 2))

The output:

theta 0 is %d [ 10.3] theta 1 is %d -6.0 squared mean error is %f 78.0

Next time if we get new data, such as the size of the house, we can predict the price with this optimum hypothesis function as a result of machine learning. Let’s say

the size of the house that we want to predict is 75, what do you think the price will be?

price = theta 0 + theta 1 x price = -6 + 10.3 * 75 price = 766.5

We call this type of machine learning as linear regression, and it is one of the easiest type. đź™‚

**Multivariate Linear Regression**

But as we know, in real world, the price of the house does not depend on one factor, which is, the size in this previous example. There are many factors that can decide

the price of the house, such as the number of the number of the bedroom, the location, the age, the building material, etc.

size of the house (m2) | number of bedrooms | location | age | building material | price |

10 | 2 | senopati | 5 years | wood | 300 |

20 | 5 | pluit | 2 years | concrete | 400 |

20 | 1 | senayan | 1 years | bamboo | 350 |

28 | 3 | pluit | 2 years | concrete | 420 |

50 | 10 | kelapa gading | 8 years | concrete | 705 |

The basic idea is same. There is a formula to find the optimum hypothesis function and scikit-learn has the ability to calculate it.

**Logistic Regression**

Linear Regression is to predict the result which is continuous. But sometimes you want to predict result which is binary: “yes” or “no”. For example, in housing situation,

based on certain house characteristics (such as number of bedrooms, location, age, building material, price) and potential buyer characteristics (income, age, marital status),

would this buyer take a loan to buy the house. That way, we can “advertise” houses more effectively to certain users.

**Neural Network**

In our first example, we deal only with one factor or feature, which is the size of the house. But sometimes you have to deal with a large numbers of features, not just 10 or 20, but

50000 features. Using multivariate linear regression will not be efficient. So some smart people invent neural network to deal with this. It’s hard to imagine the case of predicting

the price of the house using 50000 features, but there is a more realistic example. Suppose you want to categorize the pictures of the house automatically. So if you get the picture of

the front side of the house, you want to put it in “Front Outside” automatically. The picture of the kitchen? Put it in the kitchen category. The picture of the garage? Put it in the

garage category.

The image recognition uses a lot of features. You want to train your machine to detect whether certain shapes from the pixels represent the front side of the house or the back side of

the house. You need to feed it with a lot of train samples and tell the machine whether this represents front side of back side or inner side of the house.

**Anomaly Detection**

We have a series of data coming to us. Sometimes we want to check the outliers. The case in the housing is to detect whether there is an anomaly in the price of the houses in certain

areas. Based on certain characteristics, does this price of the house make sense? Perhaps someone rigs the price of the house. Or you can detect fraud in selling house transaction.

**Conclusion**

So the age of ultron is coming, I mean, the age of machine is coming. If you want to succeed, for example, in property start-up, you certainly need machine learning. Your technology

stack will be 10x better than the average.