Hexbyte Hacker News Computers

#### Introduction

I read a lot of deep learning papers, typically a few/week. I’ve read probably several thousands of papers. My general problem with papers in machine learning or deep learning is that often they sit in some strange no man’s land between science and engineering, I call it “academic engineering”. Let me describe what I mean:

- A scientific paper IMHO, should convey an idea that has the ability to explain something. For example a paper that proves a mathematical theorem, a paper that presents a model of some physical phenomenon. Alternatively a scientific paper could be experimental, where the result of an experiment tells us something fundamental about the reality. Nevertheless the central point of a scientific paper is a relatively concisely expressible idea of some nontrivial universality (and predictive power) or some nontrivial observation about the nature of reality.
- An engineering paper shows a method of solving a particular problem. Problems may vary and depend on an application, sometimes they could be really uninteresting and specific but nevertheless useful for somebody somewhere. For an engineering paper, things that matter are different than for a scientific paper: the universality of solution may not be of paramount importance. What matters is that the solution works, could be practically implemented e.g. given available components, is cheaper or more energy efficient than other solutions and so on. The central point of an engineering paper is an application, and the rest is just a collection of ideas that allow to solve the application.

Machine learning sits somewhere in between. There are examples of clear scientific papers (such as e.g. the paper that introduced the backprop itself) and there are examples of clearly engineering papers where a solution to a very particular practical problem is described. But the majority of them appear to be engineering, only they engineer for a synthetic measure on a more or less academic dataset. In order to show superiority some ad-hoc trick is being pulled out of nowhere (typically of extremely limited universality) and after some statistically non significant testing a victory is announced.

There is also the fourth kind of papers, which indeed contain an idea. The idea may even be useful, but it happens to be trivial. In order to cover up that embarrassing fact a heavy artillery of “academic engineering” is loaded again, such that overall the paper looks impressive.

This happens to be the case for “An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution“, a recent piece from Uber AI labs, which I will dissect in detail below.

#### Autopsy

So let’s jump into the content. Conveniently we don’t even need to read the paper (though I did) and we can just watch a video the authors have uploaded to youtube:

The presentation of the clip is a bit strange given the context. Particularly the blooper seems a bit out of line, but whatever. Perhaps I just can’t get their sense of humor.

OK, so what do we have here: the central thesis of the paper is the observation that convolutional neural networks are not performing very well in tasks that require localization or more precisely, where the output label is a more or less direct function of the coordinate of some input entity and not any other property of that input. So far so good. Conv-nets indeed are not well equipped for that, as the collapsing hierarchy of feature maps modeled after the Neocognitron (itself modeled after the Ventral visual pathway) was almost designed to ignore where things are. Next the authors propose a solution: it is to add the coordinates as an additional input maps in a convolutional layer.

Now that sounds extremely smart, but what the authors are in fact proposing is something any practitioner in the field would take for granted – they are adding a *feature* that is more suitable to decode the desired output. Anyone doing any practical work in computer vision sees nothing extraordinary in adding features, though it is a subject of a heated purely academic debate among some deep learning circles, where researchers totally detached from any practical application are arguing that we should only use learned features because *that way is better. *So perhaps it is not bad that deep learning crowd now begins to appreciate feature engineering…

Anyway, so they added a feature, explicit values of the coordinates. Now they are testing the performance on toy dataset which they call “Not-so-Clevr” perhaps to outrun critics like me. So are their experiments clever? Let us see. One of the tasks at hand is the generation of one-hot image based on coordinates or generate the coordinates from one-hot image. They show that adding the coordinate to the convolutional net does improve the performance significantly. Perhaps this would be less shocking, if they’d sat down and instead of jumping straight to Tensorflow, they could realize that one can explicitly construct a neural net that will solve the one-hot to coordinate association without any training whatsoever. For that incredible task I will use three operations: convolution, non-linear activation and summation. Luckily all of those are basic components of a convolutional neural net, let’s see:

import scipy.signal as sp import numpy as np # Fix some image dimensions I_width = 100 I_height = 70 # Generate input image A=np.zeros((I_height,I_width)) # Generate random test position pos_x = np.random.randint(0, I_width-1) pos_y = np.random.randint(0, I_height-1) # Put a pixel in a random test position A[pos_y, pos_x]=1 # Create what will be the coordinate features X=np.zeros_like(A) Y=np.zeros_like(A) # Fill the X-coordinate value for x in range(I_width): X[:,x] = x # Fill the Y-coordinate value for y in range(I_height): Y[y,:] = y # Define the convolutional operators op1 = np.array([[0, 0, 0], [0, -1, 0], [0, 0, 0]]) opx = np.array([[0, 0, 0], [0, I_width, 0], [0, 0, 0]]) opy = np.array([[0, 0, 0], [0, I_height, 0], [0, 0, 0]]) # Convolve to get the first feature map DY CA0 = sp.convolve2d(A, opy, mode='same') CY0 = sp.convolve2d(Y, op1, mode='same') DY=CA0+CY0 # Convolve to get the second feature map DX CA1 = sp.convolve2d(A, opx, mode='same') CX0 = sp.convolve2d(X, op1, mode='same') DX=CA1+CX0 # Apply half rectifying nonlinearity DX[np.where(DX<0)]=0 DY[np.where(DY<0)]=0 # Subtract from a constant (extra layer with a bias unit) result_y=I_height-DY.sum() result_x=I_width-DX.sum() # Check the result assert(pos_x == int(result_x)) assert(pos_y == int(result_y)) print result_x print result_y

Lo and behold : one h