[Machine learning at a glance] 1.1 - So what is 'learning'? (1)



The most intuitive way to understand Machine learning is,
to be a machine.

1.    So what is ‘learning’?

 

Imagine you are a machine.
Just try to think like that. One day you woke up, and you were a machine.
 
As this machine, you got a mission - classifying spam emails.
Among thousands of emails, classifying only the spam.


What should you do?


The first and easiest method you could come up with, is hunting down emails containing certain words in their titles. Let’s try it.


Question 1

You are a machine which operates following command.


Command: Select the emails which include any of the following words.
                   { “discount”, “deal”, “free” }




Result:




As you can see, there are some problems.

Firstly, the business email (3) is classified as spam because of the word “Deal”. 
However, if you delete “Deal” from the words listed above, then the spam email (2) would not be classified as spam.
Secondly, the spam email (4) isn’t classified as spam because it doesn’t include any of those words in the list.

Now let’s try a more advanced way.


Question 2

You are a machine which operates following commands.


Command 1: Select the emails which include any of the following words.
            {“discount”, “free”, “premium”}


Command 2: If you find “Deal” in the title, then follow these orders:

         I. De-select the emails which also include any of the following words.
           {“Business”, “Report”, “Sales”}

        II. Select only the emails which don’t satisfy the condition “I”.




Result:



Yey! Now you successfully classified four emails correctly.
But do you really think that these two commands will classify thousands more emails correctly as well? Not realistically.
 
For instance, if spam email (1) changes “Free” to “for nothing”, then it can avoid this spam filtering.
Another email with the title “Modifications WRT Korea Deal” will be classified as spam, because it includes “Deal” but doesn’t satisfy command 2-I.
 
In the real world, those who spread spam emails research and develop their method in order to successfully advertise certain goods, following the updates of spam filtering systems.
Therefore, it doesn’t seem realistic to solve the spam classification issue by adding many more commands with countless conditions.



 



This is similar to an inexperienced novice at work, let’s call him Mr. Kim.
His immediate superior needs to set him work by giving him step-by-step instructions. When he encounters a situation which is slightly different from the guidelines he was given, then he will run to his superior and complain, “I don’t know how to do this, please solve it for me!”
Then the superior may think,


‘Why can’t he do this simple job by himself !’


Now let’s look at Mr. Yun a.k.a. the troubleshooter in this situation.
The head of department tabs his shoulder and says,

“I totally believe in you, Mr. Yun.”

He always manages to achieve results before the deadline, without being told how to do things step-by-step, as was the case with Mr. Yun. We call this ability ‘flexibility’ or ‘applicability.’

 

Keeping this in mind, let’s revert back to the issue regarding the spam emails.
How could Mr. Yun solve this problem by himself ? – Not by making a step-by-step list and modifying / adding new conditions whenever an error occurs.

Finally, today’s subject - ‘learning’ can be clearly illustrated here.
Let’s stop thinking like a machine for now, and try the following question.




Question 3

As a human, use your own ability and deal with the following case.


Find the spam emails in the [Data] section, and check the numbers of them in the [Label] section.


Data
(1) Online Casino 800% return Free entrance limited time!
(2) 50% OFF every products overnight shipping LAST Deal!
(3) Modifications of sales report related to Korea Deal
(4) Just 24 Hours Left to Get 90% Off Premium Upgrade!

Label

(There is no grading here, because you already made the solutions firsthand)



What you just did is called “labeling”, which means that you have added a label tag. More precisely, you marked which answers are correct out of the four email titles (the data).

Index

Email title

Label*

1

Online Casino 800% return Free entrance limited time!

1

2

50% OFF every products overnight shipping LAST Deal!

1

3

Modifications of sales report related to Korea Deal

0

4

Just 24 Hours Left to Get 90% Off Premium Upgrade!

1

* 1: spam, 0: not spam


Let me tell you a story. Emily is solving problems in her Korean learning workbook. If she doesn’t have the solution, she cannot know whether her answers are correct or not. Only when she has the solution can she grade herself by comparing her ‘predicted answers’ to the ‘correct answers’ as listed as the solutions.

 When her answer is correct, she happily continues marking her work; when her answer is wrong, however, she thinks about why she may have got it wrong and adjusts her initial thinking in relation to that part. This is her algorithm...

 

     Emily’s learning algorithm

1.    Hide the solutions.

2.    Solve the problems in her notebook. (Prediction stage)

3.    Compare her answers to the correct answers, as per the solution page.

4.    If it’s correct then she keeps going on;
     if it’s incorrect, then she adjusts her thinking in alignment with the correct answer.

5.    Repeat this process.



 This is the basic notion of learning.

If it’s right, then keep going on; if it’s wrong, then adjust it in direction to the right answer.

 

Let’s liken this process to that which a machine does, as an example; you have four pieces of data which have been labeled.

When you put this labeled data into a ‘learning machine’, the machine executes the following process.

 

    Machine’s learning algorithm

1.    Keep the labels (solution) away.

2.    Make predictions based on data.

3.    Compare machine’s predictions to the correct answers on the label.

4.    If it’s correct, then keep going on;
     if it’s wrong, then adjust the direction of prediction to be in line with the correct answer.

5.    Repeat this process.



Just like Emily, the machine proceeds to learn by comparing its predictions to the correct answers which are labeled. Which means,


If it’s right, then keep going on; if it’s wrong, then adjust it in direction to the right answer.

This is machine learning.






Today you turned into a machine and tried to solve the spam classification problem. At first, you tried to make logical rules one by one for each case – Mr. Kim’s method. This is called the ‘rule-based’ method.
 

Then you changed the method after facing many problems, to the ‘learning’ method – troubleshooter Mr. Yun’s method.
The result, for sure, is an overwhelming victory of machine learning!
There are other countless cases where machine learning is making enormous differences out there.



Can you tell what ‘learning’ means now?



In this lecture, you successfully ‘labeled’, and began to understand a bit of the learning process. In the following lecture, we’ll learn about the process of learning and prediction, and about training set and test set.
 
After the Chapter ‘1: Big picture at glance’, we’ll take a look into the algorithms of machine learning. Of course, through the same method of thinking as a machine and solving problems, as you did today.
Thank you very much. See you in the next lecture.





Thanks to my English editor Emily Adam

댓글 쓰기

0 댓글