### Background

While tutoring at the 2019 Deep Learning Indaba, I got to know the multi-talented Cinjon Resnik, who is currently doing his PhD with Kyunghyun Cho at NYU.

After the Indaba, Cinjon invited me to join an experiment he is running in distributed teaching and learning called Depth First Learning. One innovation that particularly resonated with me was the effort DFL makes to plot a path through “paper space”, as a way to explain a core idea or story. The story we chose as the backbone for our DFL was that of the Wasserstein GAN, which if you don’t know is an insightful twist on the ordinary Generative Adversarial Net involving many interesting ideas in probability theory and optimal transport.

### Our DFL Journey

Depth First Learning is as much about generating learning materials as it is about actually teaching the participants. Once generated, the curriculum serves as a series of guideposts for future students to recapitulate the original group’s journey, with the difficulty and pacing of each step calibrated against the group’s own learning curve. The recipe is still evolving, but our case is fairly typical. A distributed team of about 7 people, of a variety of backgrounds and skill levels, meeting regularly over the course of a month. We focused on one paper each week, culminating in an hour-long Google Hangouts session at the end of the week in which we could review and discuss, working our through any questions we had built up. The sessions were chaired and coordinated by James Allingham, and recorded for later transcription.

Each paper gave us the foundation and intuition to understand the next, until we were ready to tackle the final WGAN-GP paper. It amounted to a kind of protracted, socratic journal club, spiced up with the appearance of mystery guests who joined us in the Hangout session to explain finer details, including Martin Arjovsky (a first author on the WGAN papers), and researchers Tim Salimans, and Ishaan Gulrajani. A real treat!

### The result

You can find the finished product here. In case I appear to be taking the credit, let me emphasize that James did all the hard work — I merely asked the occasional question, shared my intuitions where I felt they could aid the others, and helped to transcribe some of the video we recorded.

### Mathematical Intuition

One major function of learning experiences like DFL is to give one enough time to build intuition about the mathematics involved. To that end, I coded up some visualizations of how various divergence and distance measures behave for some simple distributions. In all cases we use a normal distribution as a “probe”: the density plots show the distance between this probe distribution and the target distribution, as a function of the probe’s $\mu$ and $\sigma$. The Laplace approximations to the modes of the target are shown as red dots. These Laplace approximations are illustrative because we might expect that the probe will experience a minimum when it coincides with one of these approximations, since they represent local Normal approximations to that mode.

Here are two dimensional plots of the KLD between a single “probe” normal distribution and a mixture of two normals. The plots shows the divergence as a function of the “probe” Gaussian’s mean and variance. You can clearly see that for a wide probe Gaussian, the lowest divergence happens when the probe is half-way between the two modes, whereas for a narrow probe, there is a local optima at each mode (this relates to the “mode-dropping” behavior to which traditional GANs are vulnerable):