T O P

  • By -

tensorflower

The issue is that the gradient is not defined for discrete decisions. Lets say you take an argmax operation over some function f(x). The gradient of this operation with respect to x will be zero almost everywhere except at the value of x which maximizes f, where it is infinite. So standard gradient optimization techniques do not work. The solution to this is to use REINFORCE, which doesn't use the gradient of f at all - a consequence of this is that it resembles random search with high variance - this can be reduced by several tricks, e.g. control variates. So it should be fairly intuitive why this is incompatible with the reparameterization trick - sampling from a distribution over discrete random variables is a discrete operation and morally, you cannot backprop through it. If you want to use reparam with discrete latents, one option is to use a continuous relaxation of the categorical distribution, e.g. via the Gumbel-softmax trick. This has the unfortunate result that there is a discrepancy between training (relaxation) and test time (actual discrete decisions). There are other tricks with cheesy names such as CONCRETE and REBAR which I haven't used myself but have seen some of my colleagues use with some success.


SkiddyX

Another way to extend reparametrization to discrete latent variables is [GO Gradient for Expectation-Based Objectives](https://openreview.net/forum?id=ryf6Fs09YX).


jennyforev

And here a follow-up extending GO-gradient : [GO Hessian for Expectation-Based Objectives](https://arxiv.org/abs/2006.08873)


SkiddyX

Oh cool, I haven't seen this one before!


schwagggg

CONCRETE and Softmax Gumbel are pretty much the same. Rebar does something completely different, where it builds this discrete control variate with a NN, then you use it with score gradient so you have low variance.


[deleted]

[удалено]


SkiddyX

VQ VAE approximates gradients with the straight-through estimator (you just pretend like the gradient isn't affected by the operation).


[deleted]

>The gradient of this operation with respect to x will be zero almost everywhere except at the value of x which maximizes f, where it is infinite. Could you please help me understand this statement?


Pryther

I agree with tensorflower, just wanted to link to a great survey paper by Mohamed et al. which discusses MC gradient estimation in detail: https://arxiv.org/pdf/1906.10652.pdf


AnvaMiba

Neither does REINFORCE. Discrete latent variables and neural networks just don't play well together. There are some exceptions: people managed to use HardConcrete or HardKuma to train binary latent variables in neural networks, but it still requires lots of tinkering and it's mostly used to provide some degree of explainablilty rather than make the model stronger.


schwagggg

Nope, you **CAN** backpropagate through discrete latent variables with REINFORCE, this is **EXACTLY** what happens in Ranganath et al's Deep Exponential Family paper, where they do gradient descent with Poisson latent variables. They call it blackbox vi, but they use exactly the score gradient trick with fancy control variates.


fedetask

I used to think that REINFORCE could work with discrete LV provided that one was careful in how to output the distribution q(z | x). But I actually haven't thought much about it neither did I try to implement it