**Soap** (Second Order Activity Propogation) is a package for experimenting with feedback alignment and activity propogation in PyTorch. It was my project for the [IBRO-Simons Computational Neuroscience Summer School]()http://imbizo.africa. For more information on Feedback Alignment, see "Random synaptic feedback weights support error backpropagation for deep learning".

The project started out as an attempt to simply get feedback alignment on MNIST. Using a modern framework like Pytorch this turns out to be pretty easy to do, though slightly more verbose than I would expect. We can write a `FeedbackAlignmentLinear`

layer that behaves like `Linear`

for the forward pass, but uses a custom, random B-matrix to backpropogate gradients:

```
class FALinearFunction(Function):
@staticmethod
def forward(ctx, data_in, weight, bias, b_matrix):
ctx.save_for_backward(data_in, weight, bias, b_matrix)
return data_in.mm(weight.t()) + bias.unsqueeze(0).expand_as(data_in)
@staticmethod
def backward(ctx, grad_out):
data_in, weight, bias, b_matrix = ctx.saved_tensors
return grad_out.mm(b_matrix), grad_out.t().mm(data_in), grad_out.sum(0), None
class FALinear(nn.Module):
def __init__(self, num_in, num_out):
super().__init__()
self.weight = nn.Parameter(torch.Tensor(num_out, num_in))
self.bias = nn.Parameter(torch.Tensor(num_out))
b_matrix = torch.zeros(num_out, num_in)
self.register_buffer('b_matrix', b_matrix)
init.xavier_normal_(b_matrix)
init.xavier_normal_(self.weight)
init.uniform_(self.bias, 0, 0)
```

This layer does indeed train on MNIST, though more slowly than standard backpropogation:

I'd like to share my journey over the few days this project ran, because it took me far afield and led to some strong opinions (always a good outcome).

The immediate question is: can we improve on this? Although the two curves above look "close", drawing a horizontal line through both curves shows that FA takes roughly twice as long to reach a given loss as BP --- at least on this particular run.

An observation by Muscovitz showed that we can improve performance a lot by breaking the rules a bit: instead of allowing *no* weight transport, we can allow a little bit of information to periodically "leak" into the B matrix. In fact, it is literally one **bit** of information per neuron: merely the *sign* of the transpose of the weight matrix. Moreover, this can be done every \(T\) examples rather than every example (hence "slow"). We can write this as:

\[B_t \leftarrow \operatorname{sign}(W_t^T) \mid t \in \{T, 2T, 3T, \ldots\}\]

We indeed get quite a jump in performance, to beyond that of ordinary backprop, which is surpising!

Unfortunately this single run is (unsuprisingly) misleading. There is a subtle effect here, which is that we happened to have picked a learning rate that was suboptimal for SGD. By controlling directly for learning rate we in fact find that the gains are much more modest. We can do this by modifying how we modify the \(B\) matrix as follows:

\[B_t \leftarrow |W_t| \operatorname{sign}(W_t^T) \mid t \in \{T, 2T, 3T, \ldots\}\]

The performance gain largely disappears, and this form of FA is back to being worse than BP.

There is literally a confounder here, which we can express using this causal diagram:

**NOTE: THIS BLOG POST IS UNFINISHED.**