Imagine yourself as a teacher and consider this marking method. You give the students a list of hundred questions to prepare for a test. Your students appear for a test where you ask five questions. The results are not up to your expectations. You see that almost everybody answered questions 1 and 3 more or less correctly. Most were not able to solve questions 2, 4, and 5. You give the answer sheets back to your students to review their performance. You then take another test. You ask the same five questions. This time they will do better than the first. Even in the questions that answered correctly the first time, they will write better. Imagine doing this again and again for ten times. It is likely that at the tenth iteration they have done significantly better than the first time. Is this a good method to evaluate student's understanding?
Now consider this scenario. You are a computer engineer and have built an AI software that can say if the input picture is that of a cat or a dog. You show 100 pictures of dogs and 100 pictures of cats to the software. After they have seen (processed) the picture, you show them ten previously unseen pictures of cats and dogs. The new pictures are different from the old ones. In some dogs are sitting under a tree whereas tree was absent in the old pictures. The new pictures contain previously unseen breeds. Naturally, the software fails on some of the new pictures: identifies dogs as cats and vice versa. You then tell the software that it failed on N% of the pictures (or succeeded in 100-N%). It will do some adjustments and next iteration the performance might improve. Imagine doing this again and again for ten times. It is likely that the performance is significantly better than the first time. Is this a good method to evaluate software's understanding?
Most, if not all, will say that the method used for student's evaluation is measuring something other than understanding. Of course, they came prepared for the test with some kind of understanding, but the adaptation does not rely on understanding. The method evaluates how well students have adapted to their previous tests. Now re-consider the second scenario. It is doing very similar to what students were doing, adapting to the percentage of labelling (answers) it got wrong. If you give enough time (iterations), the classification (labelling) software might eventually get a very high accuracy. This is called a feedback mechanism in engineering. If the software (or hardware for that matter) can receive feedbacks, it can perform better the next time. Note that the software has not seen the test data. All it knows is if its performance has improved or not.
The situation is like that of the student tests. Students have not seen the correct answer. They saw the marks and since it was less than full-marks they changed their answer in the next test. The software similarly does not need to understand what a dog is and what a cat is to perform well. All it is doing is adjusting its parameters (i.e. adapt) when it predicts wrong. Accuracy therefore does not necessarily reflect understanding and most researchers do not claim it as well. But media and press will do, and we get headlines of machines surpassing human intelligence. Here, we have to read intelligence as a synonym for some kind of accuracy metric. Accuracy measurements itself has a lot of problems but that is for another post.
There has been tremendous improvement in the performance of commercial AI systems. The progress is there to see. Evaluating understanding remains a difficult challenge. A child understanding of a cat is different from the AI system. AI vision systems have been fooled by manipulation of one or a few pixels invisible to the human eye. Change a few pixels and it mistakes a cat for a bus. So, what has it understood about what a cat is? "True understanding" is a matter of heated debate. There is no agreement. My worry is that based on impressive performance in a specific dataset, and training on a ginormous dataset, some tech-companies have hastened at making it commercially available. Adversarial attacks such as small changes to pixels can completely break the system. Clearly, human understanding is different from what a machine understands. If "true understanding" was to walk past us, we would not recognize it. When AI is responsible for deciding employment or sentencing in a court, people will ask to specify understanding in concrete terms. Right now, "understand" is misused as a marketing terminology and that is the noise we hear a lot more. That noise blankets over the good work some people are doing at trying to understand understanding.