To make my point a subtle way I have posed these three challenges over the years...
Challenge: image recognition
Take a look at image test collections intended to test image recognition algorithms. For example, Caltech101  makes image recognition too easy by placing the object in the middle of the image. This means that the photographer gives the computer a hand. Also, there is virtually no background – it’s either all white or very simple, like grass behind the elephant. Now the size of the collection: 101 categories with most having about 50 images. So far this looks too easy.
Let’s look at the pictures now. It turns out that there is another problem elsewhere. The task is in fact too hard! The computer is supposed to see that the side view of a crocodile represents the same object as the front view. How? By "training". Suggested number of training images: 1, 3, 5, 10, 15, 20, 30...
The idea “training” (or "machine learning") is that you collect as much information about the image as possible and then let the computer sort it out by some sort of clustering. One approach is appropriately called “a bag of words” - patches in images are treated the way Google treats words in text, with no understanding of the content. You can only hope that you have captured the relevant information that will make image recognition possible. Since there is no understanding of what that relevant information is, there is no guarantee.
Then how come some researchers claim that their methods work? Good question. My guess is that by tweaking your algorithm long enough you can make it work with a small collection of images. Also, just looking at color distribution could give you enough information to “categorize” some images – in a very small collection, with very few categories.
One principle I'd suggest: first solve the problem for black-and-white images. Then you can continue with a low-level analysis – find the objects, their locations, sizes, shapes, etc. Once you’ve got those abstract objects, you can try to figure out what those objects represent. In fact, the "recognition" task is often very simple. For example, for a home security system you don’t need to detect faces to sound alarm. The right range of sizes will do. A moving object larger than a dog and smaller than a car will trigger the alarm. All you need is a couple of sliders to set it up.
Challenge: learning addition
The approach to machine learning is generally as follows. Imagine you have a task the people do easily but you want the computer to do it. Suppose you don’t understand how they do it exactly, certainly no enough to be able to design an algorithm. Then you solve the problem in these three "simple" steps:
- You set up a program that supposedly behaves like the human brain (something you don’t really understand),
- you teach it how to do the task by providing nothing but feedback (because you don’t understand how it’s done),
- the program runs pattern recognition and solves the problem for you.
My thinking is, if you can teach computer to recognize objects, you can teach it simpler things! How about teaching computer how to add -- based entirely on feedback.
This would be a better experiment than image recognition because
- it is simpler and faster,
- the feedback is unambiguous,
- the ability to add is easily verifiable.
Essentially you supply the computer with all sums of all pairs of numbers from 0 to, say, 99 and then see if it can compute 100 + 100.
Numerically this is of course easy, but symbolically, I believe, impossible.
To see why consider even simpler task:
- Given a function with f('1','1') = '2', can the computer figure out that f('1', '2') = '3'?
Of course it the computer is "aware" that these are numbers, a simple interpolation will produce the desired result. However, the computer will treat these as symbols and where would the idea of “is bigger than” or “is the following number of” come from if not from the person who creates the model?
As for trying to imitate human vision, here's my thoughts. We simply don't know how a person looking at an apple forms the word ‘apple’ in his brain. This is also a part of another pattern – trying to emulate nature to create new technology. The idea is very popular but when has it ever been successful? Do cars have legs? Do planes flap their wings? Do ships have fins? What about electric bulb, radio, phone? One exception may be stereo vision...
Challenge: counting objects in an image
Let's consider a very simple computer vision problem: given an image, find out whether it contains one object or more.
The problem is of practical importance (see Image analysis examples).
Let’s assume that the image is binary so that the notion of “object” is unambiguous. Essentially, you have one object if any two 1’s can be connected by a sequence of adjacent 1’s. Anyone with a minimal familiarity with computer vision (some even without) can write a program that solves this problem. But that’s irrelevant here because the computer is supposed to learn on its own, as follows.
- You have a computer with a general purpose machine learning program (meaning that no person provides insight into the nature of the problem).
- Then you show the computer images one by one and tell it which ones contain one object and which one more than one.
- After you’ve shown enough images, the computer will gradually start to classify images on its own.
First, why "gradually"? Why not give the computer all the information at once and it instantly becomes good at the task? Well, the problem is that this way you can’t keep tweaking the algorithm. But I think the main reason why machine learning is popular is that everyone likes to teach. It’s fun to see your child/student/computer learn something new and become better and better at it. This is very human – and also totally misplaced.
My second question is, This can work but what would guarantee that it will? More narrowly, what information about the image should be passed to the computer to ensure that the computer will start to succeed more then 50% of the time, sooner or later?
Option 1: we pass some information. Well, what if you pass information that cannot possibly help to classify the images the way we want? For example, you may pass just the value of the (1,1) pixel, or the number of 1’s, or 0’, or their proportion. Who will make sure that the relevant information (i.e., adjacency) isn’t left out? Computer doesn’t know what is relevant – it hasn’t learned “yet”. If it’s the human, then he would have to solve the problem first, if not algorithmically then at least mathematically. Then the point of machine learning as a way to solve problems is lost.
Option 2: we pass all information. For example, the computer represents every 100x100 image a point in the 10,000-dimensional space and then runs clustering or another pattern recognition method. Will this work? Will the one-object images form a cluster? Or maybe a hyperplane? One thing is clear, these images will be very close to the rest because the difference between an image with one object and two may be just a single pixel (in other words the set is dense).
BTW, this “simple” challenge of counting the number of objects may be posed for texts instead of images. For example, computer is given a text describing a street, answer: how many houses, windows, trees, people?
My conclusion is, don’t apply machine learning anywhere software is expected to replace a human and solve a problem.
So, when can machine learning be useful? In cases where the problem can’t be, or hasn’t been, solved by a human. For example, a researcher is trying to find a cause of a certain phenomenon and there is a lot of unexplained data. Then - with some luck - machine learning may suggest to the researcher a way to proceed. Pattern recognition is a better name then.
The problem of collecting enough information from the image is also shared by CBIR (Content Based Image Retrieval).
See also Bad math.