# project / soap

See project at github.com.

Soap (Second Order Activity Propogation) is a package for experimenting with feedback alignment and activity propogation in PyTorch. It was my project for the [IBRO-Simons Computational Neuroscience Summer School]()http://imbizo.africa. For more information on Feedback Alignment, see "Random synaptic feedback weights support error backpropagation for deep learning".

The project started out as an attempt to simply get feedback alignment on MNIST. Using a modern framework like Pytorch this turns out to be pretty easy to do, though slightly more verbose than I would expect. We can write a FeedbackAlignmentLinear layer that behaves like Linear for the forward pass, but uses a custom, random B-matrix to backpropogate gradients:

class FALinearFunction(Function):

@staticmethod
def forward(ctx, data_in, weight, bias, b_matrix):
ctx.save_for_backward(data_in, weight, bias, b_matrix)
return data_in.mm(weight.t()) + bias.unsqueeze(0).expand_as(data_in)

@staticmethod
data_in, weight, bias, b_matrix = ctx.saved_tensors

class FALinear(nn.Module):
def __init__(self, num_in, num_out):
super().__init__()
self.weight = nn.Parameter(torch.Tensor(num_out, num_in))
self.bias = nn.Parameter(torch.Tensor(num_out))
b_matrix = torch.zeros(num_out, num_in)
self.register_buffer('b_matrix', b_matrix)
init.xavier_normal_(b_matrix)
init.xavier_normal_(self.weight)
init.uniform_(self.bias, 0, 0)

This layer does indeed train on MNIST, though more slowly than standard backpropogation:

I'd like to share my journey over the few days this project ran, because it took me far afield and led to some strong opinions (always a good outcome).

The immediate question is: can we improve on this? Although the two curves above look "close", drawing a horizontal line through both curves shows that FA takes roughly twice as long to reach a given loss as BP --- at least on this particular run.

An observation by Muscovitz showed that we can improve performance a lot by breaking the rules a bit: instead of allowing no weight transport, we can allow a little bit of information to periodically "leak" into the B matrix. In fact, it is literally one bit of information per neuron: merely the sign of the transpose of the weight matrix. Moreover, this can be done every $$T$$ examples rather than every example (hence "slow"). We can write this as:

$B_t \leftarrow \operatorname{sign}(W_t^T) \mid t \in \{T, 2T, 3T, \ldots\}$

We indeed get quite a jump in performance, to beyond that of ordinary backprop, which is surpising!

Unfortunately this single run is (unsuprisingly) misleading. There is a subtle effect here, which is that we happened to have picked a learning rate that was suboptimal for SGD. By controlling directly for learning rate we in fact find that the gains are much more modest. We can do this by modifying how we modify the $$B$$ matrix as follows:

$B_t \leftarrow |W_t| \operatorname{sign}(W_t^T) \mid t \in \{T, 2T, 3T, \ldots\}$

The performance gain largely disappears, and this form of FA is back to being worse than BP.

There is literally a confounder here, which we can express using this causal diagram:

NOTE: THIS BLOG POST IS UNFINISHED.