# Translation of the book by Andrew Eun "The Passion for Machine Learning" Chapters 15 - 17

15 Simultaneous evaluation of several ideas during error analysis

Your team has several ideas on how to improve the cat's coder in your application:

Solve the problem with the fact that your algorithm assigns dogs to cats

Solve the problem with the fact that your algorithm recognizes large wild cats (lions, panthers, etc.) as pets

Improve the system performance on fuzzy images

You can evaluate all these ideas simultaneously. Usually, I create a special table and fill it up for about 100 cases of erroneous classification of the validation (dev) sample. I also make brief comments that can help me remember specific examples in the aftermath. To illustrate this process, let's look at a summary table that you could create from a small set of examples of your validation (dev) sample

Picture

Dogs

Big cat

Fuzzy

1

x

Pitbull of unusual color

2

3

x

x

A lion; photo taken at the zoo on a rainy day

4

x

The panther behind the tree is

Share (%)

25%

50%

50%

Picture 3 in the table above applies to large cats and to fuzzy. Thus, due to the fact that we can assign one image to several categories of errors, the total percentages in the bottom line are not limited to 100%.

Despite the fact that in the beginning of the work you can form a certain set of categories for errors (Dogs, Big Cats, Fuzzy Images) in the process of manually classifying classification errors into these categories, you may decide to add new types of errors. For example, suppose you looked at a dozen images and decided that a lot of mistakes were made by the classifier on the images from the Instagram, on which color filters are superimposed. You can alter the table, add the "Instagram" column to it, and recategorize the errors again, taking into account this category. By examining manually the examples on which the algorithm is mistaken and asking yourself how you, as a person, were able to correctly mark an image, you will be able to see new categories of errors and, possibly, be inspired to find new solutions.

The most useful error categories are those for which you have an idea for improving the system. For example, adding the category "Instagrams" will be most useful if you have an idea how to remove filters and restore the original image. But you should not limit yourself to only those categories of errors for which you have a recipe for their elimination; The goal of the error analysis process is to develop your intuition in choosing the most promising areas of concentration.

Error analysis is an iterative process. Do not worry, if you start it, not having thought up any category. After viewing a couple of images, you will have several ideas for categorizing errors. After manually categorizing several images, you may want to add new categories and reconsider classification errors in the light of newly added categories and so on.

Suppose that you have completed the analysis of errors from 100 erroneously classified examples of the validation sample and obtained the following:

Picture

Dogs

Big cat

Fuzzy

1

X

Pitbull of unusual color

2

X

3

X

X

A lion; photo taken at the zoo on a rainy day

4

X

The panther behind the tree is

Share (%)

8%

43%

61%

Now you know that working on a project to eliminate an erroneous classification of dogs, like cats, at best will eliminate 8% of the errors. Work on Big Cats or over Fuzzy images will help get rid of a much larger number of errors. Therefore, you can choose one of these two categories and focus on them. If your team has enough people for simultaneous work in several areas, you can ask several engineers to do large cats, concentrating the efforts of others on fuzzy images.

Analysis of errors does not give a rigid mathematical formula, indicating which task you need to assign the highest priority. You also need to correlate the progress you receive as a result of working on the various categories of errors and the effort you need to spend on this work.

16 Cleaning the validation and test samples from incorrectly labeled examples

When analyzing the errors, you may notice that some examples in your validation sample are incorrectly labeled (assigned to the wrong class). When I say "mistakenly labeled", I mean that the images have already been incorrectly classified when they are marked by a human before the algorithm detects it. That is, when marking the example (x, y) for y, the wrong value was specified. For example, suppose some images on which no cats are mistakenly labeled, like containing cats and vice versa. If you suspect that the proportion of erroneously tagged examples is significant, add the appropriate category to track incorrectly marked examples:

Picture

Dogs

Big cat

Fuzzy

Error in the markup

98

X

Wrongly marked as having a cat in the background

99

X

100

X

Painted cat (not real)

Share (%)

8%

43%

61%

6%

Do I need to correct the incorrectly marked data in your validation sample? Let me remind you that the task of using a validation sample is to help you quickly evaluate algorithms so that you can decide whether algorithm A is better than B. If the portion of the validation sample that is marked incorrectly prevents you from making such a judgment, then it makes sense to spend time for correcting errors in the markup of the validation sample.

For example, imagine that the accuracy that your classifier shows is as follows:

Overall accuracy on the validation sample 90% (10% total error)

Error related to markup errors 0.6% (6% of the total error on the validation sample)

Error related to other causes
9.4% (94% of the total error on the validation sample)

Here, an error of 0.6% due to incorrect labeling may not be sufficient with respect to a 9.4% error that you could improve. Manual correction of markup errors of the validation sample will not be superfluous, but its correction is not critical, since it does not matter if the real total error of your system is 9.4% or 10%

Suppose you improve the cat's classifier and achieved the following accuracy:

Overall accuracy on the validation sample 98% (2% total error)

Error related to markup errors 0.6% (30% of the total error on the validation sample)

Error related to other causes
1.4% (70% of the total error on the validation sample)

30% of your error is due to incorrect labeling of the images of the validation sample, this share makes a significant contribution to the overall error in assessing the accuracy of your system. In this case, it is worth to improve the layout of the validation sample. Eliminating incorrectly marked examples will help you to find out what the errors of your classifier are closer to 1.4% or to 2%. Between 1.4 and 2 there is a significant relative difference.

It is not uncommon for incorrectly marked images of a validation or test sample to begin to attract your attention only after your system has improved so much that the error rate associated with incorrect examples will grow with respect to the overall error on these samples.

The following chapter explains how you can improve error categories such as Dogs, Big Cats and Fuzzy while working on improving algorithms. In this chapter, you learned that you can reduce the error associated with the category "Errors in markup" and improve the quality by improving the markup of data.

Regardless of the approach you take to mark up the validation sample, do not forget to apply it to the test sample layout, so your validation and testThe new samples will have the same distribution. Applying the same approach to the validation and test samples, you will warn the problem that we discussed in Chapter 6 when your team optimizes the quality of the algorithm on the validation sample, and later realizes that this quality was evaluated on the basis of a different from the validation test sample.

If you decide to improve the quality of the markup, consider the possibility of double checking. Check both the markup of examples that your system has classified incorrectly, and the markup of examples that are classified correctly. It is possible that both the original markup and your learning algorithm were mistaken for the same example. If you correct only the markup of those examples on which your system made a mistake in the classification, you can introduce a systematic error in your assessment. If you take 1000 examples of validation, and if your classifier shows an accuracy of 98.0%, it's easier to test 20 examples that were classified incorrectly than 980 correctly categorized examples. Due to the fact that in practice it is easier to check only incorrectly classified examples, in some cases a systematic error may occur in the validation samples. Such an error is permissible if you are only interested in developing applications, but this will be a problem if you plan to use your result in an academic research article or need to measure the accuracy of the algorithm on a test sample completely freed from a systematic error.

17 If you have a large validation sample, divide it into two subsamples, and consider only one of them

Suppose you have a large validation sample, consisting of 5000 examples where the error rate is 20%. Thus, your algorithm incorrectly classifies about 1000 validation images. Manual estimation of 1000 images will take a long time, so we can decide not to use them all for the purpose of error analysis.

In this case, I would unambiguously divide the validation sample into two subsamples, one of which you will observe, and the other not. You are more likely to retrain on the part that you will manually analyze. You can use the part that you do not use for manual analysis, for setting model parameters.

Let's continue our example described above, in which the algorithm incorrectly classified 1000 examples of 5000 components validating the sample. Imagine that you want to take 100 errors for analysis (10% of all errors in the validation sample). It is necessary to randomly select 10% of the examples from the validation sample and compile "
Validation sample of the eyeball
"(
? Eyeball dev set
), We called them so in order to remember all the time that we study these examples with our own eyes.

Translator's note: from my point of view, the definition of "selection of the eyeball" does not sound good (especially from the point of view of the Russian language). But with all due respect to Andrew (and bearing in mind that I did not think of anything better), I'll leave this definition

(For a speech recognition project in which you will listen to audio clips, perhaps you would instead use something like "validation for the ears"). Thus, the validation of the eyeball consists of 500 examples, in which there should be about 100 incorrectly classified. The second subsample of the validation sample, which we will call
Validation selection of the black box (Blackbox dev set)
, will consist of 4500 examples. You can use the "Black Box Sub-selection" to automatically evaluate the quality of the classifiers, measuring their share of errors. You can also use this subsample to choose between algorithms or to configure hyperparameters. However, you should avoid looking at the examples of this subsample with your eyes. We use the term "Black Box" because we will use a subsample, its component, as a "black box"

approx. translator : i.e. The object whose structure is not known to us
to assess the quality of classifiers.

Why do we explicitly divide the validation sample into the "Sub-sample of the eyeball" and "Black box sub-selection"? Since at some point you will feel better (understand) the examples in the "Sub-selection of the eyeball", the probability that you will retrain on this subsample will increase. To track this retraining, the "Black box sub-selection" will help you. If you see that the quality of the algorithms on the "Eyeball Selection" grows significantly faster than the performance on the "Black Box Selection", you are presumably retrained on the "Eyeball". In this case, you may need to discard the existing subset of the Eyeball and create a new one, transferring more examples from the Black Box to the Eyeball or taking a new portion of the marked data.

Thus, splitting the validation sample into the "Subselect of the eyeball" and "Black box sub-selection" allows you to see the moment when the process of manual error analysis leads you to retrain at a subset of the eyeball.
+ 0 -