Dummy Introduction to Machine Learning

When the uninitiated heard about machine learning, they would think about the nemesis of the Avengers, which is Ultron. The highly intelligent robot which was created by human
accidentally used Artificial Intelligence to move, learn, and think. But here I am going to provide the boring side version of Machine Learning and although it is more
boring than Ultron, but it is still awesome for me. I am going to tell the story of Machine Learning in relation to housing business.

Predict the Price of the House

If you learn Machine Learning through many sources, this is the most favourite topic. Supposed you have data of the relation between size of the house and the price:

size of the house (m2) price
10 100
20 200
30 290
40 420
50 505

Now you train your “robot” with this data so later it can predict the price of the house with different size. If I ask your “robot”, what is the price of the house if the
size of that house is 100 m2, intuitively, without “robot”, anyone can answer, roughly it will be 1000. But to automate the task, we need to use learning algorithm to find
the hypothesis function. Then you feed the size of the house to this hypothesis function to get the price.

h theta (x) = theta0 + theta1 x

Here, x is the size of the house. So the question is how do we choose theta0 and theta1 to get the most optimum hypothesis function?
Looking at the data, we can just answer it easily, theta0 is 0 and theta1 is 10.
That would be a good hypothesis function. The best hypothesis function is the function that has the minimum of the mean of the squared difference between the actual result
and result from our hypothesis function if we feed
the input from our samples to the function. But we have to start somewhere. Basically we just choose “random” theta0 and theta1 and learning rate. With this hypothesis
function, we “retest” our input from our samples. What is the mean of the squared difference between the actual result and the result from our hypothesis function? The ideal result
would be 0. But sometimes it is not possible. At least we must strive to get the minimum result.

What do I mean by calculating the mean of the squared difference between the actual result and the result from our hypothesis function? Consider if we choose theta0 as 5 and theta1 as 3,
we will have this function:

h theta (x) = 5 + 3x

Let’s feed the input to this hypothesis function.

input actual result result from hypothesis function difference
10 100 5 + 3 * 10 = 35 -65
20 200 5 + 3 * 20 = 65 -135
30 290 5 + 3 * 30 = 95 -195
40 420 5 + 3 * 40 = 125 -295
50 505 5 + 3 * 50 = 155 -350

The mean squared difference is (-65 * -65 + -135 * -135 + -195 * -195 + -295 * -295 + -350 * -350) / 5 = 54000.
Why do we square the difference? To avoid the netralization between positive difference and negative difference.

Now, what you want to achieve to the efficiency of the function (usually it is called cost function) is as low as possible.

Consider our hypothesis function that we got from our intuition:

h theta (x) = 0 + 10x

Let’s feed the input to this hypothesis function.

input actual result result from hypothesis function difference
10 100 0 + 10 * 10 = 100 0
20 200 0 + 10 * 20 = 200 0
30 290 0 + 10 * 30 = 300 10
40 420 0 + 10 * 40 = 400 -20
50 505 0 + 10 * 50 = 500 -5

The cost function is (10*10 + -20*-20 + -5*-5) / 5 = 105. This is much closer to 0.

So how do we get these numbers: 0 and 10? That is what machine learning is doing.

If we are going to calculate it mathematically, here is the formula:

theta0 = theta0 – learning_rate * the_derivatives_of_squared_mean_error_with_twist

the_derivatives_of_squared_mean_error_with_twist can be described with this pseudocode:

sum = 0
for input in inputs:
  difference = (theta0 + theta1 * input) - actual_result
  sum += difference
return (sum / number_of_inputs)

theta1 = theta1 – learning_rate * the_derivatives_of_squared_mean_error_with_twist

the_derivatives_of_squared_mean_error_with_twist can be described with this pseudocode:

sum = 0
for input in inputs:
  difference = (theta0 + theta1 * input) - actual_result
  sum += difference * input
return (sum / number_of_inputs)

Choosing the learning rate is quite an art. Too big learning rate would make us miss the local minimum. Too small learning rate would make us slow
in finding the local minimum.

The idea is to get the better theta 0 and theta 1 when we feed them to these functions. These better theta 0 and theta 1 will be used by us as next input to get better theta 0 and theta 1! And so on. Until we get the optimum result.

So let’s try to find the optimum hypothesis function using the starting numbers: 5 and 3. We choose 0.001 as our learning rate.

The calculation would output:

theta0 theta1 squared minimum error
5.208 10.693760000000001 638.9666201600023
5.1849792 9.925074623999999 106.11774910331778
5.18504198208 10.0019412781376 100.80010695760147
5.182798701753791 9.994321911133627 100.73779809847709
5.1807862457180285 9.995144221515096 100.72909361493737

These are first 5 iterations, but it is still not optimum. If we keep going down the rabbit hole:

… after 15000 iterations …

theta0 theta1 squared minimum error
-5.267938224854745 10.280035008944099 78.0974389896887
-5.268071336898212 10.280038639212536 78.09740355791446
-5.26820442473769 10.280042268820877 78.097368139024
-5.268337488377579 10.28004589776924 78.09733273301296
-5.268470527822279 10.280049526057745 78.09729733987675

Here, we can see we are almost on the top.

With these latest values, the cost function that we got is even better than the cost function from our hypothesis that we got from our intuition.
In other words, y = 0 + 10x that we thought is optimum is worse than the more optimal function that we got from this formula: y = -5.268470527822279 + 10.280049526057745 x.

Calculating manually is tiresome, let’s use startegy. We can use programming language and library to get the hypothesis function faster dan better. Use scikit-learn and python.

from sklearn import linear_model
import numpy as np
size = [[10], [20], [30], [40], [50]]
price = [100, 200, 290, 420, 505]
regr = linear_model.LinearRegression()
regr.fit(size, price)

print("theta 0 is %f", regr.coef_)
print("theta 1 is %f", regr.intercept_)
print("squared mean error is %f", np.mean((regr.predict(size) - price) ** 2))

The output:

theta 0 is %d [ 10.3]
theta 1 is %d -6.0
squared mean error is %f 78.0

Next time if we get new data, such as the size of the house, we can predict the price with this optimum hypothesis function as a result of machine learning. Let’s say
the size of the house that we want to predict is 75, what do you think the price will be?

price = theta 0 + theta 1 x
price = -6 + 10.3 * 75
price = 766.5

We call this type of machine learning as linear regression, and it is one of the easiest type. :)

Multivariate Linear Regression

But as we know, in real world, the price of the house does not depend on one factor, which is, the size in this previous example. There are many factors that can decide
the price of the house, such as the number of the number of the bedroom, the location, the age, the building material, etc.

size of the house (m2) number of bedrooms location age building material price
10 2 senopati 5 years wood 300
20 5 pluit 2 years concrete 400
20 1 senayan 1 years bamboo 350
28 3 pluit 2 years concrete 420
50 10 kelapa gading 8 years concrete 705

The basic idea is same. There is a formula to find the optimum hypothesis function and scikit-learn has the ability to calculate it.

Logistic Regression

Linear Regression is to predict the result which is continuous. But sometimes you want to predict result which is binary: “yes” or “no”. For example, in housing situation,
based on certain house characteristics (such as number of bedrooms, location, age, building material, price) and potential buyer characteristics (income, age, marital status),
would this buyer take a loan to buy the house. That way, we can “advertise” houses more effectively to certain users.

Neural Network

In our first example, we deal only with one factor or feature, which is the size of the house. But sometimes you have to deal with a large numbers of features, not just 10 or 20, but
50000 features. Using multivariate linear regression will not be efficient. So some smart people invent neural network to deal with this. It’s hard to imagine the case of predicting
the price of the house using 50000 features, but there is a more realistic example. Suppose you want to categorize the pictures of the house automatically. So if you get the picture of
the front side of the house, you want to put it in “Front Outside” automatically. The picture of the kitchen? Put it in the kitchen category. The picture of the garage? Put it in the
garage category.

The image recognition uses a lot of features. You want to train your machine to detect whether certain shapes from the pixels represent the front side of the house or the back side of
the house. You need to feed it with a lot of train samples and tell the machine whether this represents front side of back side or inner side of the house.

Anomaly Detection

We have a series of data coming to us. Sometimes we want to check the outliers. The case in the housing is to detect whether there is an anomaly in the price of the houses in certain
areas. Based on certain characteristics, does this price of the house make sense? Perhaps someone rigs the price of the house. Or you can detect fraud in selling house transaction.

Conclusion

So the age of ultron is coming, I mean, the age of machine is coming. If you want to succeed, for example, in property start-up, you certainly need machine learning. Your technology
stack will be 10x better than the average.

Setup Django With Python 3 and Nginx in Vagrant

If you want to setup Django project with Python 3 and Nginx in Vagrant, follow me. At first I used Django one click install image in Digital Ocean, but I got Python 2. I was disappointed. So I plan to use Django with Python 3 in Digital Ocean but first I have to setup it in vagrant for my development purpose. So here are the steps:

Install Vagrant

Go to Vagrant Homepage. Choose your installer file based on your OS. Once it is done:

mkdir project_dir

cd project_dir

vagrant init

Then change the Vagrantfile. Make it like this:

# -*- mode: ruby -*-
# vi: set ft=ruby :
Vagrant.configure(2) do |config|
  config.vm.box = "ubuntu/trusty64"

  config.vm.network "private_network", ip: "192.168.33.10"

  config.vm.host_name = "project.dev"
end

Then do this to initiate the vagrant machine. I give project.dev as this vagrant machine’s hostname.

vagrant up

Then go inside vagrant machine, do this:

vagrant ssh

Install Python

I decided not to use Python 3 that comes with Ubuntu because the pip support is broken. There is a way to overcome that problem. But I prefer to use Python from source. My friend does not recommend this solution. But I contribute to Python core regularly (https://hg.python.org/cpython/search/?rev=vajrasky&revcount=160) so I am comfortable with Python from source. I even debated once with Python core developer about Python technical decision. Of course, you are welcome to use Python 3 from Ubuntu.

First, we install the dependencies of building Python from source.

sudo apt-get install build-essential libbz2-dev libncurses5-dev libreadline6-dev libsqlite3-dev libgdbm-dev liblzma-dev tk8.6-dev libssl-dev python3-setuptools

mkdir ~/download

cd ~/download

wget https://www.python.org/ftp/python/3.4.2/Python-3.4.2.tgz

tar -xvf Python-3.4.2.tgz

sudo mkdir /opt/python

sudo chown -R vagrant:vagrant /opt/python

cd Python-3.4.2

./configure --prefix=/opt/python

make

make install

As you can see I installed a bunch of things just to make sure the compiling process of Python runs 100% smooth. All modules will be built.

Setup Nginx

Install Nginx server. It will be the gateway to the outside world. Outsider will be greeted by Nginx before they can be served by Django application.

sudo apt-get install nginx

Then change Nginx config file.

sudo vim /etc/nginx/sites-available/default

Make it like this:

server {
    listen 80 default_server;
    listen [::]:80 default_server ipv6only=on;

    root /usr/share/nginx/html;

    index index.html index.htm;
    server_name project.dev;

    location / {
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header Host $http_host;
        proxy_redirect off;
        proxy_pass http://127.0.0.1:8000;
    }

    location /media  {
        alias /vagrant/project_project/project_project/media;
    }

    location /static {
        alias /vagrant/project_project/project_project/static;
    }

    keepalive_timeout 5;

    client_max_body_size 4G;
}

Then restart Nginx.

sudo service nginx restart

Setup Python Virtual Environment

I don’t want to populate system wide directory if I want to install third-party Python libraries. Create a directory for our python virtual environment.

sudo mkdir /opt/project_env

sudo chown vagrant:vagrant /opt/project_env

/opt/python/bin/pyvenv /opt/project_env

It is better to give the permission for Python virtual environment to any user beside root so when we create Python virtual environment, we can avoid sudo. Sudo will use Python from system not your custom Python.

Install Gunicorn and Django

Gunicorn will be our Django webserver.

First, activate Python virtual environment.

source /opt/project_env/bin/activate

Then install Gunicorn.

pip install gunicorn

And install Django.

pip install django

Deactivate the Python virtual environment.

deactivate

Setup PostgreSQL

This will be our database of choice.

sudo apt-get install libpq-dev postgresql postgresql-contrib

Then create database.

sudo su - postgres

createdb project_db

createuser project_user

Go inside PostgreSQL command prompt.

psql

Do this to set the password and grant permission.

alter user project_user with password 'penguinbercinta';

grant all privileges on database project_db to project_user;

\q

Get back as vagrant user.

exit

Then install psycopg2 library inside Python virtual environment.

source /opt/project_env/bin/activate

pip install psycopg2

Deactivate the Python virtual environment.

deactivate

Setup Django Application

Create Django application.

source /opt/project_env/bin/activate

cd /vagrant

django-admin startproject project_project

Before you go on, you need to change the setting of Django project so it uses PostgreSQL.

vim /vagrant/project_project/project_project/settings.py

Change the ‘DATABASES’ part to it looks like this:

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.postgresql_psycopg2',
        'NAME': 'project_db',
        'USER': 'project_user',
        'PASSWORD': 'penguinbercinta',
        'HOST': 'localhost',
        'PORT': '',
    }
}

Then migrate database.

cd /vagrant/project_project

python manage.py migrate

deactivate

You should see no errors.

Setup Supervisor

We will use Supervisor to run Gunicorn. This is to make sure we can monitor and restart Gunicorn easily. First, install Supervisor.

sudo apt-get install supervisor

Then add Gunicorn config file for Supervisor.

sudo vim /etc/supervisor/conf.d/gunicorn.conf

Make it like this:

[program:gunicorn]
command=/opt/project_env/bin/gunicorn --bind 0.0.0.0 project_project.wsgi
directory=/vagrant/project_project
user=vagrant
autostart=true
autorestart=true
stderr_logfile=/var/log/gunicorn.err.log
stdout_logfile=/var/log/gunicorn.out.log

The –bind 0.0.0.0 phrase is to make sure we can can access Gunicorn directly without Nginx for development from host.

Restart the Supervisor.

sudo service supervisor restart

The End

To be able to use project.dev in your host (not your vagrant), you need to associate the ip address of vagrant, 192.168.33.10 to project.dev. In Mac or Linux (not in your vagrant), do this:

sudo vim /etc/hosts

Add this line:

192.168.33.10 project.dev

From your browser, open http://project.dev to see your Django application served through Nginx and http://project.dev:8000 to see your Django application served directly with Gunicorn.

You should see something like this:

Initial Django First Page
Initial Django First Page

To install Django on IIS, you can take a look at a blog in Toptal.