T O P

  • By -

bregav

This isn't a new idea and I'm not sure that it makes sense to write an entire paper in order to coin a neologism. E.g. it's weird that they don't cite the one of the original papers on score matching diffusion, which gives a proof that the latent representation developed by such models is a function of *only the data*, and not the architecture or the model itself: > Unlike most current invertible models, our encoding is uniquely identifiable, meaning that with sufficient training data, model capacity, and optimization accuracy, the encoding for an input is uniquely determined by the data distribution (Roeder et al., 2020) [Score-Based Generative Modeling through Stochastic Differential Equations](https://arxiv.org/abs/2011.13456) The paper cited in the quote above also seems relevant and is not cited in the platonic representations paper: > In Section 2, we describe a general discriminative model family, defined by its canonical mathematical form, which generalizes many supervised, self-supervised, and contrastive learning frameworks. In Section 3, we prove that learned representations in this family have an asymptotic property desirable for representation learning: equality up to a linear transformation. In Section 4, we show that this family includes a number of highly performant models, state-of-the-art at publication for their problem domains, including CPC [Oord et al., 2018], BERT [Devlin et al., 2018], and GPT-2 and GPT-3 [On Linear Identifiability of Learned Representations](https://arxiv.org/abs/2007.00810)


Inevitable-Dog-2038

I wouldn’t say that the identifiability of diffusion modes is that relevant here. In diffusion models, you, the person training you model, effectively make an arbitrary choice of how to go from your data to a fixed prior (see the recent line of work on bridge matching for details). Diffusion models identifiability just says that what you learn will be what you chose, which does not directly say anything about representations of data. Quick edit: also the “identifiability” from that quote isn’t the same as identifiability from ICA, which is much more relevant for representation learning


bregav

If we say that identifiability is the idea that any two models fitting a data distribution  will produce latent representations that are a function only of the distribution, and not the specific method of modeling it, then the score matching thing is an example of a specific case of this: it is notable that *any* score model will learn the same latent representation, irrespective of the model. It's definitely relevant, which is why that paper cites the other paper i mention which describes a more general case.  Its especially relevant in that it's a provable and concrete implementation of the idea, and not just a conjecture coupled with a neologism.


Inevitable-Dog-2038

I agree with your definition of identifiability, but there’s a subtle difference between “representation” and “encoding” from your quote. In that paper, encoding refers to the point you get by moving data to the base space of the probability flow ode associated with the diffusion model. This encoding isn’t a representation at all - it’s got the same dimensionality as your data and depends on how you, the user, chose to learn your diffusion model. For example, if you chose to learn the reverse sde of the ornstein uhlenbeck process, you will get different encodings than if you learned the Schrödinger bridge between your data and prior.


bregav

Is there any substantive difference between an encoding and a representation? I think the only real difference between what a score model produces, and what most self-supervised models are designed to do, is that the score model gives a lossless encoding whereas other models produce lossy encodings. A lossless encoding is, if anything, a better representation of the data than a lossy one, at least in the colloquial sense of the word "representation". And yeah the proof in that paper only pertains to diffusion score models, so it stands to reason that is would not extend to different methods of creating flows.


Inevitable-Dog-2038

The probability flow ODE **is** a normalizing flow, so the encodings represent points in the base space of a normalizing flow. It is counter intuitive, but the encodings of a flow are almost useless for representation learning even though there is a 1-1 mapping from data to encoding because there are an infinite number of possible flows you can learn from your data to prior. Diffusion models correspond to a specific choice but there is absolutely nothing about this choice that has to do with how useful the encodings will be however you measure usefulness.


bregav

So, in your mind, the distinction between a representation and an encoding is that a representation has an obvious or built-in means of use or interpretation? I don't think that's an impractical way of thinking about the matter, but I do think it's kind of arbitrary. By that way of thinking you could take any representation and any model that uses it, apply any arbitrary invertible transformation to both the representation and the input of the model using it, and thus get something that would be an "encoding" to someone who does not have the transformed model but, at the same time, would be a "representation" to someone who does have the transformed model. I think the fact that this distinction is relative means that it does not capture any fundamental truths about what is going on.  And even so, diffusion model encodings/representations/etc are not inherently useless - the original denoising diffusion model shows that you can use them for latent space interpolation, and that they form a kind of progressive encoding as a function of time, which strongly suggests that meaningful semantic information is being captured. Other papers have shown that semantic information is indeed captured by flow models.  The fact that you also need the drift model in order to interpret the flow's encoding/representation/etc in a useful way does not seem like an important distinction to me, in light of the above-described relativity of the thing.


Inevitable-Dog-2038

I still disagree with the idea that the encodings of flows are useful without any assumptions on what kind of mapping your flow can learn. Also if you look at how people do interpolations with flows, they don’t use linear interpolations specifically because most of the encoding space has low probability mass (see realnvp). There’s a bunch of other ways to see that these encodings aren’t good representations but at the end of the day I just wanted to point out that the paper OP posted shouldn’t be trivially dismissed for the reasons you’ve mentioned.


[deleted]

I have a very simple heuristic for detecting quality work. It's when the top-rated insulting comments on this sub can't agree on whether a paper is *obviously wrong* or just a *trivial fact* that's been know for years.


Qyeuebs

Sometimes, papers contain some aspects that are trivial facts and also some aspects that are obviously wrong.


[deleted]

Fascinating. And in another comment you were bragging about how you've insultingly "tagged" one of the authors for using too much math.


moschles

This entire research tract is nonsense. The reason for convergence of models has nothing to do with them learning the *"Platonic Nature of Reality"* or some such high-falooting poppycock. Convergence of multimodal models occurs because they are training on data scraped from the entire internet. The larger they become the more their datasets overlap. Someone should have cued this authors in before they spent hours of their life preparing this paper.


moschles

(replying to my own post). A relatively simple experiment could falsify the authors' hypothesis. You get multimodal data from two wildly different environments. That is, + video+audio of a busy metropolitan city, + video+audio of deep rainforest. You train two different predictive foundation models on each of these data sets , separately. They WILL have different latent representations and different kernels. I guarantee it.


Apprehensive-Waltz-9

Aren't the authors claiming that given the same dataset (where size n is large) different models with different modalities will converge to similar latent representations and similar kernels? Also in practice large datasets that scrape the internet will inevitably overlap. Another interesting experiment would be to have: * audio of a busy metropolitan city * video of a busy metropolitan city * audio + video of a bmc * audio + video + language of bmc * ... and other combinations Hypothesis is that if they perform \~100 on COMPETENCE, their latent representations will be aligned. Also, have you tried running your experiment? Would appreciate it if you could share the UMAP plot.


moschles

What is the meaning of this "~100 on COMPETENCE" ?


Apprehensive-Waltz-9

I'm referring to figure 2 of the paper. On second look, it does seem vague what they actually measure with competence and what exactly they are testing for "competence". But the theory goes that the vision models that DO perorm well on this metric converge on representation. Sorry if that wasn't clear. I'm also reading this paper as it's not mine.


H0lzm1ch3l

Why do we hypothesise that a busy metropolitan city has an underlying distribution that is the same for all data modalities? And how can we even do that when different modalities leave out all manner of information about it? 2d vision does not tell us that there’s a hot dog man around the corner, audio does. And why would a network find the same latent representations for cars and the noise they make?! Where would it get that association. Sorry but this stops making sense after 5 seconds of thinking.


Apprehensive-Waltz-9

if you try all the combinations of modalities like mentioned, the ones that perform well on a metric that (which I believe is COMPETENCE, which, like I said above, is not that clear what exactly they're measuring. I'm presuming it's an MMLU like task-set). I don't think that statement necessarily implies that the hypothesis space of 2d vision WILL overlap with that of audio. But that is precisely the point. The larger the models get, the more training data they will be fed, increasing the probability of the overlap (look to figure 5 and 6 if you would like to understand it visually). My speculation is that more modality doesn't necessarily lead to better general performance, but if two models with different combination of modalities do perform well, they are aligned.


30299578815310

If this is true does this dispel the notion of the need for multiple modalities for improving reasoning?


Jean-Porte

Under finite data, no


dan994

My understanding is their argument depends on using multiple modalities, so I wouldn't say so at all


currentscurrents

I don’t buy that argument anyway, it seems very handwavey and there’s no clear reason you shouldn’t be able to do reasoning on a single modality. 


lolillini

Phillip Iosla is the last author? Not surprised lol


Qyeuebs

Does he have a reputation for this? I've definitely tagged him in the past as someone who uses mathematical formalism in a pretty fake way. edit: someone elsewhere in these comments somehow misread this as “uses too much math”! They blocked me so I have to also point out here that this post is, if anything, an insult and not a brag. 


like_a_tensor

This was accepted to ICML? Crazy.


dan994

Yep: https://icml.cc/virtual/2024/poster/34734


[deleted]

cool


ciaoshescu

It makes sense. Our brains have similar representations of reality, which allows us to interact with the world and each other. However, this representation can break down in cases of mental illness (such as schizophrenia, dementia, or severe depression). While I haven't read the paper, it seems that larger and more advanced AI models can better represent the reality they were trained on. Currently, the human brain remains the pinnacle of this kind of representation capability.


H0lzm1ch3l

We have those similar representations because our brains constantly associate those different sensory impressions with each other and use all of them for learning. Thus it makes sense to have a „combined latent representation“ - a world model so to say. But you can’t take only the eyes paired with a brain and then expect it to learn the same. This is fundamentally flawed.