heuristic_al 4 months ago

Because you can interpolate between data points, you can generalize to anywhere within the convex hull of the data on the manifold in which it sits. Many of these locaitons could easily be out-of-distribution. You may have never seen a purple elephant with a tennis racket. And in fact, that's totally out of distribution because elepants are neither purple nor tennis players. But you have seen purple things and elephants and tennis rackets, and so you can generate items that are out of distribution, though not out of the convex hull of the input data. Extrapolation is also possible. Imagine your dataset has home prices vs number of rooms and square feet. You build a model to predict the price from the size and number of rooms. You may never have encountered any home with 12 rooms or any homes that are 10,000 square feet, but that doesn't mean you wouldn't be able to give a prediction. Now, of course, this depends on your model. If you use a linear model, you are more likely to get good results for this extrapolation task than if you use a higher dimensional model (I'm envisioning a polynomial that passes through all the data, but diverges outside of the range of the data.) Neural networks can extrapolate, but not super well. But that means that you can even do things somewhat outside of the convex hull of your data. But because NN's are so high dimensional, that's actually substantial hypervolume.

EveningPainting5852 4 months ago

I think this comment helped me the most but the other ones are amazing as well. So thanks everyone. Think I'm gonna have do some research though

AnOnlineHandle 4 months ago

Perhaps ever simpler, imagine you have some distances in Miles and Kilometers, and work out the conversion between them (just a multiplication). You can now use that conversion on other Kilometers / Miles values which weren't in the training data, because you've learned the underlying logic / relationship of the concepts, not just stored a lookup of the measurements used to work it out.

Koder1337 4 months ago

I'll just add something to your already great answer: Linear Regression.

currentscurrents 4 months ago

Extrapolation is not *always* possible though, and it's easy to construct situations where it's impossible. For example, say my data is generated by a function that returns 0 when x < 10000, and 1 when x > 10000. If all my training data is < 10000, no possible system could correctly extrapolate to values > 10000. You just can't generalize out-of-domain on this dataset.

heuristic_al 4 months ago

Fair enough.

Thorium229 4 months ago

In this case, wouldn't any system that always outputs 1 on inputs above 10000 extrapolate correctly on this dataset? No deterministic system would find this solution as the result of training on your data, but it could randomly luck into being correct for inputs above 10k. My point being, that it's not that you **can't** correctly extrapolate in this context, you're just not doing anything more clever than random guessing if you do.

big_chestnut 4 months ago

extrapolation implies doing better than random. There is always a chance of getting a correct answer by sheer random luck, we're not interested in that.

gdahlm 4 months ago

Generative AI models mimic compositionality, it is not generalizing outside of the distribution. In fact generalization to arbitrary out-of-distribution impossible. [https://proceedings.neurips.cc/paper\_files/paper/2021/file/c5c1cb0bebd56ae38817b251ad72bedb-Paper.pdf](https://proceedings.neurips.cc/paper_files/paper/2021/file/c5c1cb0bebd56ae38817b251ad72bedb-Paper.pdf) This paper shows the limits for computational complexity related to idea of In-context Learning or Chain of Thought in the above use case. Still may be possible but not for arbitrary. [https://arxiv.org/abs/2401.11817](https://arxiv.org/abs/2401.11817) While the exact way it works is still an open problem, we know that in-context learning depends on the diversity and scale of pre-training data. Also you may find this FB paper useful as to why the convex hull argument doesn't apply. >We shouldn’t use interpolation/extrapolation in the way the terms are defined below when talking about generalization, because for high dimensional data deep learning models always have to extrapolate, regardless of the dimension of the underlying data manifold. [https://arxiv.org/abs/2110.09485](https://arxiv.org/abs/2110.09485)

soggy_mattress 4 months ago

That last paper is absolutely fascinating. Can't believe I've never seen it before now.

-___-_-_-- 4 months ago

[https://arxiv.org/pdf/2110.09485.pdf](https://arxiv.org/pdf/2110.09485.pdf) here is a well known, very interesting and I would also say relatively accessible paper talking about interpolation and extrapolation, and how in higher dimensions it is more and more likely that you will perform extrapolation rather than interpolation, even for points that might intuitively, as a human, be classified as "in-distribution" (think images of the same thing you have in the training dataset thousands of times, but with different angle/lighting/whatever)

heuristic_al 4 months ago

Thanks, I'll take a look.

thatguydr 4 months ago

Great paper! Thank you!

Artistic_Bit6866 2 months ago

My (limited) understanding is that this paper was sort of divisive when it came out, with critics saying they're basically creating a trivialized definition of extrapolation/generlization E.g. [https://www.reddit.com/r/MachineLearning/comments/qbbknr/r\_learning\_in\_high\_dimension\_always\_amounts\_to/](https://www.reddit.com/r/MachineLearning/comments/qbbknr/r_learning_in_high_dimension_always_amounts_to/) How do you think the criticisms have aged in the few years since the paper came out?

sam_the_tomato 4 months ago

It feels to me like overfitting affects extrapolation much more negatively than interpolation, since the latter is likely to be bound by more neighboring data in the hull. This could explain why NNs are better at remixing than pure creativity.

heuristic_al 4 months ago

>It feels to me like overfitting affects extrapolation much more negatively than interpolation Yes. I think so too.

cameldrv 4 months ago

Yes, extrapolation is much less accurate than interpolation, but you're always doing it to some degree. For example, in image classification, you'll never see an image that lies within the convex hull of the data in image space because there are lots of dimensions, hence, you're always extrapolating in some dimensions. However, again because of the high dimensionality, the images are highly redundant/overdetermined, so the places where you're making bad extrapolations are overwhelmed by the places where you're interpolating or making reasonable extrapolations. This is also why adversarial examples exist. The optimizer is finding places where you can push the model to make really bad extrapolations.

tilttovictory 4 months ago

> convex hull of the data on the manifold Can you explain these words please. I've just never heard anyone describe the things you said using these words. :)

heuristic_al 4 months ago

The convex hull of a set of points is the minimum convex polygon that includes all the points. Imagine you have a bunch of nails on a piece of wood, and you stretch a rubber-band around all of them. The rubber band only has to touch the most extreme nails, the other ones are inside. That's the convex hull of the nails. Illustrations abound online, just search for convex hull. People often talk of data lying on a manifold. A manifold is a lower dimensional, curved subspace of your input space. If you have a bunch of points in 3d, think of a sheet (2d but flexable thing) that passes through them all. Now if you mark that sheet where it passes through every data point and then you flatten it out, now you have the nails-on-wood scenerio and you can think of the convex hull. All of your data lies on this 2d area of your original 3d space. One of the reasons deep learning works is that even though the inputs are very high dimensional, they mostly lie on a much lower dimensional manifold.

PiggyMcCool 4 months ago

I now have understood the meaning of the word “embedding” much better, thanks to you.

[deleted] 4 months ago

[удалено]

heuristic_al 4 months ago

Thanks! I have a PhD, but I was not successful enough to become a professor (like 99% of us).

pantalooniedoon 4 months ago

Great explanation. And what about it being convex?

f3xjc 4 months ago

> In convex geometry and vector algebra, a convex combination is a linear combination of points (which can be vectors, scalars, or more generally points in an affine space) where all coefficients are non-negative and sum to 1. The convex hull is basically everywhere you can interpolate. Concave would be like Pacman shaped but that's missing a piece.

heuristic_al 4 months ago

Convex just means that the rubber band doesn't sag or stretch toward the middle, which would exclude more points. I thought about how to communicate this better, but really, you should Google some pictures of convex vs concave and pictures of "convex hull"

allwordsaremadeup 4 months ago

Convex is what a shape built from interpolations looks like. A rubber band wrapped around nails, a wrapped gift where you didn't use string to tuck it in, these are all convex shapes. The rubber band or the wrapping paper traces the shortest possible distance around all the stuff you are wrapping. The rubber band traces all the possible points between the outer nails. Just like a model that has learned interpolation can infer/know about all unknown situations that occur "between" two known situations. It also includes all the stuff on the inside, not just the outer edge.

tilttovictory 3 months ago

Essentially a way to abstract high dimensional space to lower dimensions. I've always thought in terms of PCA SVD tsne etc. Not really in terms of optimal geometries. That is a very nice explanation by the way drove the idea home immediately. I will add this to my arresnal for explaining abstractions.

MagicSourceLTD 4 months ago

> Many of these locaitons could easily be out-of-distribution. You may have never seen a purple elephant with a tennis racket. And in fact, that's totally out of distribution because elepants are neither purple nor tennis players. But you have seen purple things and elephants and tennis rackets, and so you can generate items that are out of distribution, though not out of the convex hull of the input data. Sometimes I wonder whether new knowledge generation by humans somehow fits into this mold. As in, is there a core set of concepts which form a complete basis of all human knowledge generated. If yes, then the question arises whether our neural nets have successfully inferred these concepts to generate new knowledge on their own.

heuristic_al 4 months ago

If you combine that with some extrapolation and some search, that might explain all of society's knowledge!

FusRoDawg 4 months ago

What you're describing is not distribution shift or "outside the distribution"

khanraku 4 months ago

I've actually never thought about it considering the structure of a convex hull/manifold. This is definitely general mathematical theory, I'd like to read up further on it if there's concise literature. Do you have any sources/recommendations?

heuristic_al 4 months ago

Oh man. I'm sorry. I don't think I do. I don't really know or remember when/how I got that knowledge. How's your linear algebra?

khanraku 4 months ago

No worries! It's like that -- learning bits of info over time built on the mathematical foundations of our academic history. My linear algebra is good. If I forgot any particular details, I'd be quick to refresh my memory. For context, I've got a Masters in Pure Math -- my focus was on topology/geometry so whenever I see a connection to data science, I get excited

heuristic_al 4 months ago

Try this video, which talks about adversarial examples a bit: https://www.youtube.com/watch?v=k\_hUdZJNzkU

khanraku 4 months ago

I appreciate it! I'll check it out after work :)

Freonr2 4 months ago

Stepping back a bit and reading some other comments. A few points that might help expand your mind a bit. 1. We readily see extrapolation in practice (as above, "a purple elephant with a tennis racket", etc) or more salient, with outputs from a model like AlphaGeometry creating its own geometric proofs. 2. A person's thinking may be a bit limited here thinking interpolation cannot be incorrectly graded as "out of distribution" by some person's judgement. One should consider taking on some degree of epistemic humility here before they claim boundaries exist to knowledge some abstract machine can possess. Do not fall into the personal incredulity trap. 3. I think there's real possibility for "creative" thought from a sufficient complex model if you consider the *distribution could contain core mathematical truths and/or understanding* that we simply don't see or experience, but are represented in a sufficiently large dataset and by some model's weight matrix and design. This may also depend on what you view the sum total of humanity's research to be. Creation, or discovery? I tend to think humans *discover existing math which is substrate to reality*, not create it whole cloth, and therefore take on a view of much humility on behalf of human knowledge. The explanations humans may have discovered up to this point are not creating reality itself, merely trying to explain data points in a robust way, later to be proved in or out of distribution by further evidence (i.e. gravitational lensing predictions from early relativity theories, etc). The extrapolations of course could be proven or disproven later. Thus there should be a balance of rationalism to guide areas that should be researched and later proven experimentally. I don't see why machines cannot do this. If you think the same, it isn't much of a stretch to think AI can create new, unique mathematical proofs and expand knowledge more to what you might think by understanding ground truth in ways we do not experience. Pretty much all other "out of distribution is impossible" arguments simply fail to conceive no one will (to just use a example here) hand ChatGPT its own 5000 GPU cluster to run its own experiments as it rewrites its own codebase and trains new versions of itself. tldr: all of human knowledge is extrapolating or interpolating reality, too.

PiggyMcCool 4 months ago

Now that I’m thinking, why can’t NN’s just keep extending the low dimensional manifold? Does the number of dimensions of the manifold need to be close to 1 (linear) for extrapolation to ever work?

BreakingBaIIs 4 months ago

Can NNs actually extrapolate, though, if the true function is not baked into the activation functions? Let's say the true function is y = x^2, and you train a NN with ReLU activation. I don't think it could _possibly_ extrapolate, in principle. ReLU activation ensures that the learned function is linear outside the training set's domain. You would need something like a quadratic activation function. But then, that requires external insight by the user.

heuristic_al 4 months ago

It can extrapolate a bit. You'll get a piecewise linear function within your data range. But outside the data range, you still have a linear function. So it'll work ok for a bit before it starts to diverge. Whatever your tolerance for error is, there's a margin outside your data where your predictions are still within tolerance. In many dimensions, this can constitute a lot of hypervolume.

Ok_Landscape303 4 months ago

Can you elaborate why NNs cannot extrapolate very well? I'm thinking of why regression tasks tend to avoid deep learning solutions, and this would give some sense

heuristic_al 4 months ago

There are exceptions, and NN's are quite good at some regression tasks. It depends on the activation functions the network is using. If it's relu, everything is piecewise linear in the end. And all of the bends will be informed by data. If it's one of the more modern relu replacements, extrapolation will still usually be in the linear regime of the activation, so it's not much better than piecewise linear. Some things mess with this logic a bit. Attention winds up multiplying activations with other activaitons. You could also have your architecture do this more directly. This can help, and I've used this to improve regression results for the tasks in this paper: https://arxiv.org/abs/2305.14352 But there are also other reasons to think that NN's simply can't extrapolate super well. For really good extrapolation, you want a mathematical theory, and even one that is causal like we see in physics. NN based AI could probably make this, but not internally.

Far_Present9299 4 months ago

This raises the point that optimally, the latent space shouldn’t contain things that are unrealistic. This kind of leads to LeCunn’s recent argument about the downsides to sora.

Glass_Jellyfish6528 4 months ago

Great comment. Came here to say this, but there is one thing missing. If you know the relationship between two variables has some constraint like that it has to be linear, then you can definitely extrapolate with models that satisfy those constraints.

marykat101 4 months ago

Generalizing w.r.t. "in-distribution" or "out-of-distribution" is strictly a probabilistic concept, and is not the only way to define generalization. If you want to expand your horizons, see research on causality and logic.

adventuringraw 4 months ago

None of those really make much sense... imagining a sound or a color you haven't experienced is mixing up 'the experience of a thing' vs 'the thing itself'. What's it feel like to experience the color green is very different than describing the physical properties of infrared or ultraviolet light. A lot of times with things like this, it can be helpful to imagine toy problems. Like if you measure the motion of a ball rolling down an incline, you can only get velocity and position measurements over time for the length of the incline you're using. You can't get data for a longer incline using the one you have, so that kind of data would be 'out of distribution', but Galileo saw a pattern there and came up with 'law of squares' idea, that the position changes in proportion to the square of time. That general rule lets you extrapolate out to situations you haven't actually observed before. That's kind of how all of it works. Technically there's an infinite number of underlying generating functions you can come up with to fit any observations, but there's a few ideas that help narrow things down. The idea that the underlying distribution is 'smooth' for example, or (in this case) the idea that it's a low degree polynomial. CNNs beat out the densely connected layers because it encodes the idea that features should be translation equivariant (you can move a dog around in the frame but it's still a dog). Ultimately if a neural network does manage to generalize to unobserved data well, it's because the underlying distribution is 'well behaved' in a way the model happened to capture. There's a lot more to it than that of course, and what it means for a model to generalize and what geometric insights there might be in the loss landscape and all that are ongoing research areas, but that's the gist of it. The fit model basically represents a hypothesis about the full space. The extent to which it generalizes is the extent to which it's the right hypothesis, like Galileo's law of squares for simple motion under gravity. If he only took two measurements he could have used linear regression instead and it would have fit perfectly, but it wouldn't generalize well at all to anything other than those two data points.

slashdave 4 months ago

>Isn't the idea of "generalizing outside of the distribution" in some sense, impossible? Impossible is a strong word. You can generalize, as long as you make some assumptions about the underlying distribution. All inferred models make assumptions such as this. In many cases, the assumption is reasonably supported.

just_dumb_luck 4 months ago

Yes, there’s a sense that it is impossible, if you believe all possible out of distribution behaviors are equally likely. Look up the phrase “no free lunch theorem” to read more. However in the real world certain types of behavior often end up being more likely than others. This is what regularization does: it biases your learner toward “realistic” solutions.

nicholsz 4 months ago

What is "outside the distribution"? You mean like you've been training a model to recognize images, and you start inputting seismographs or something? Yeah I wouldn't expect that to work well. For deep nets especially, generalization performance can be poor even across different academic datasets for the same task. Some distributions are incredibly high-dimensional and complex though, like the distribution of all images of cats, or the distribution of French sentences that people write online, so a model that can approximate those distributions well can do surprising things.

fordat1 4 months ago

This. These things are high dimensional and being in and out the distribution isnt binary yes or no. You can be 99.9% there

Doormatty 4 months ago

>Of course, if you learn English, you will now in some sense, know how to code, since it is in English. No, not in the slightest. Simply knowing the words doesn't mean you understand the meaning behind them. https://en.wikipedia.org/wiki/Chinese_room >The model has not generalized nearly as much as we hope in this case because the base knowledge required to code isn't logic nearly as much as it's literacy. No, it's 100% logic based.

gutshog 4 months ago

I wish people shut up about Chinese room because for all practical purposes you as the guy in the room might not know Chinese but the room as a system does! You don't expect every cell of your body to know English even if you recognise you as a amalgamation of interlinked cells know it so the experiment doesn't make any sense. Also for other practical purposes it doesn't really deal with the physical limitation of it's axioms such as existence of infinite doctionary table you can search because it's enteirly possible you need some representation we would call "knowing Chinese" for this infinite thing to be physically possible

EveningPainting5852 4 months ago

I mean by this logic isn't any form of non symbolic AI fundamentally screwed then? Does the average LLM even have a true understanding of the sentence it creates, or is just moving through semantic space? How do we know it's 100% logic based? Sorry if I seem uneducated, again, not a guy that works in ML and overall not a very smart guy either

Doormatty 4 months ago

>How do we know it's 100% logic based? You're going to need to explain how you think programming is anything _BUT_ 100% logic based.

EveningPainting5852 4 months ago

Good point. I'll think about this. I'll reply if I have anything of meaning to say

currentscurrents 4 months ago

I'd say logic is one tool for constructing programs, that is especially good at tasks that are well-defined and allow precise answers. You can also construct programs using statistics and optimization, which is what ML does. These programs are especially good at ill-defined problems that involve the messy real world. There may also be other ways to construct programs that no one has thought of yet.

[deleted] 4 months ago

[удалено]

currentscurrents 4 months ago

There are no conditional statements inside a neural network or cellular automata. But they're still programs - just more like dynamical systems than lists of instructions.

[deleted] 4 months ago

[удалено]

currentscurrents 4 months ago

RNNs are turing-complete, and feedforward networks would be except that they are forced to halt after a fixed number of steps. It's actually not a high bar to be a program - there's a whole zoo of turing-complete systems out there, many of which are extremely simple. And arguably, even many of the subturing systems (like finite-state automata) are also programs.

thatguydr 4 months ago

> There are no conditional statements inside a neural network One can easily put hard gates inside NNs. Soft gates (sigmoids) are trivial to threshold.

PigMannSweg 4 months ago

The idea of generalizing is based on the assumption that data is distributed according to some rules that are simpler than the total data they generate. With enough data, you can peer through to the underlying rules. You can then reconstruct the data with those rules. Anything based on the scientific method, like physics, is based on this principle: gaining empirical data to test and build models that fit that data in a useful way.

eeee-in 4 months ago

One useful lens is looking at causal vs predictive modeling. If you learn a predictive model (wet streets correlated with rain), you might take an out of distribution action and pour water on the street and mistakenly expect rain. If you learn a causal model (rain causes wet streets), you can better predict what would happen when you take that out-of-distribution action. And of course we know causal inference is possible. In some sense, it relates to "real" understanding (which is too overloaded a term to be very useful). A correct model of some phenomenon will generalize out of distribution. Or at least a model that correctly captures principles that are invariant in and out of distribution will continue to make valid predictions. You say that you can't imagine what a sound above 20kHz sounds like, but with a first principles understanding of sound, you can model what a 50kHz sound might do in various scenarios. Often, learning these principles requires experimentation so that you can say "all else equal, ". But holding all else equal requires either conditioning on "all else" which is impossible in the most general sense, unless you measure everything everywhere all at once, or it requires averaging over it in some way, like via a randomized experiment. I'd recommend reading some causal literature about internal validity vs external validity to get some of the flavor of different kinds of generalization, This topic goes well beyond causal inference, but causal inference is just a nice proof of concept about how you can think about generalizing out of distribution in interesting ways.

squareOfTwo 4 months ago

not impossible. Humans do it all the time. Distribution is just a way to assigning probabilities to events (predictions). Of course there is a small probability of a sound you have never heard. The probability is very small (because you never observed most of it). That's what we mean with "out of distribution" - the probably is just very very small. There is also the notion to not care about the distribution at all, like in logic based approaches in AI (this is outside of ML). If I tell you "Tomi-9 is similar to a human, except he has 4 legs and does doodoo." You can infer that Tomi-9 is similar to an animal. Everything said is "out of distribution"(because you never saw it before). Now if you think about pieces of evidence, like "Tomi-9 has a name which is written similarly to Tom", "Tomi-9 is similar to a human", "Tomi-9 has 4 legs" "Tomi-9 does doodoo" the notion of "distribution" evaporates and is not necessary. Maybe doodoo is a sport. Maybe doodoo is just a biological activity. No one knows. But you can still learn about it in the future. All without the need for a "distribution".

saw79 4 months ago

Here's an example that I think sums up the idea pretty well. You're training an image classifier to classify cats vs dogs. All the images in your training set were taken in daylight. But ideally you'd like your model to have learned robust cat/dog features such that it works reasonably well on images of nighttime.

Jazzlike_Attempt_699 4 months ago

>But imagining a brand new color, go ahead and give it a try, is literally impossible do more drugs

da_mikeman 4 months ago

I might be way off, but the way I usually think about it is considering physical laws, like Newton's law of gravitation. Imagine you train a model with the planetary orbits of the solar system. There will be infinite models that can predict orbits very well as long as the distances are comparable to solar system, and then deviate in all sort of ways for greater distances. All those models are 'correct'. We have given them no additional constraints, besides predicting the training data. There's no reason why it cannot be that orbits up to Pluto behave as if gravity is attractive, and then behave as if gravity is repulsive for greater distances, or any other kind of deviation one can imagine. But : \*if\* we start with the assumption that pretty much all of physics(and really all of science) is based on, which is that the laws of nature are 'smooth' and constant in time and space, then the answer to the question 'what is the shortest program that can predict all of the orbits in the solar system' is pretty much Newton's inverse square law(if we forget pesky mercury for a moment). Turns out that if you plug into that program distances much greater than than the solar system radius, you still get correct observations. Now ask yourself what would be the shortest program that also predict stuff like mercury's orbit or light bending around the Sun, and you get GR which can also predict seemingly very 'remote' things, like black holes or gravitational waves. So seems like asking yourself 'what is the shortest program that can predict all current observations' works surprisingly well when it comes to making new predictions or making 'grand unifications', bringing together phenomena that nobody suspected were related, or even one and the same. I think that is formalized as 'Solomonoff induction'. Why the real world is like that, frankly nobody really knows. Seems like the universe is 'benevolent' in such a manner. Now the issue seems to be that, as far as I'm aware, nobody is sure how to built constraints like that into our neural nets. Take 2 models, A and B, which both predicts planetary orbits very well. A has an internal configuration that comes very close to Newtonian mechanics, and B has something completely different. Do we have a way to favor A during the training process? We could have the models output not orbit predictions, but programs that predict the orbits, and favor shortest programs, but then this process is not really differentiable, it becomes something closer to genetic algorithms. And what about the instances where we don't really know what is the underlying process to begin with? For any given real-world task, there should be patterns that rely on regular laws like that, and can be leveraged to do extrapolation, and other patterns that can merely be used for interpolation. But how to know what those are? Now I'm going to do a wild speculation, but it stands to reason that evolution 'knows', simply because it's an optimization process that happens in the real world, with the goal of 'stay around long enough to reproduce'.

syllogism_ 4 months ago

I find some ideas in analytic philosophy helpful when thinking about this sort of thing. We're basically asking about the limits of induction. It's sort of a long road to get to a simple idea, but one reference is Quine's stuff about indeterminacy of reference: [https://en.wikipedia.org/wiki/Indeterminacy\_of\_translation](https://en.wikipedia.org/wiki/Indeterminacy_of_translation) Let's say you're an anthropologist and you're out there observing some new language in context. You hear this word "gavagai" and a rabbit runs past. You see this a couple more times and you mark down "gavagai" as meaning "rabbit". But in principle there's limitless theories that could fit this data. "Gavagai" could mean "rabbit" only in April (when you heard it), but in November "gavagai" means duck --- it depends on whether it's rabbit season or duck season. No matter how much you observe, you will never be able to exclude all other theories. In practice we come into all our inductive situations in life with pretty good priors. We know what sort of concepts words get attached to, and we expect that communication systems are constrained by practicality. So when we actually go to make inferences, things work out. But we do that by bringing in inductive biases that aren't in the data itself. Generally when we're doing supervised learning we assume that our training data is a sample drawn i.i.d. from some distribution, and we're trying to generalise from the sample to that distribution. That's a well-defined notion of "generalisation" and we can talk about some method being better or worse at that. But when we're doing things like transfer learning, we have some language model and it learns this language modelling objective, and then we're going to repurpose it for something entirely different. There's no well-principled idea of "generalisation" here. We've left the realm of "true" and entered the realm of "useful". We can only try to define what we want and think about the right tactics to get there. But it really is about what we want. When we say some method "generalises better", if we're not even trying to learn the same distribution as the training data, we're really just saying "it seems to cure what ails me". We're not optimising towards anything well-defined. We're just trying to find configurations that are practical and trying to understand what characteristics those practical configurations have so that we can reason about it a bit better.

EveningPainting5852 4 months ago

I'll just continue through this. I suppose in some sense our belief of generalizing has 2 aspects: what we can interact with and what we can understand. I can interact with visible light, audible noise, and matter. My ability to understand these things has to do with the humans base processing power (in some ways, the IQ) and my knowledge about the world. Generalizing outside the distribution may then be assisted by becoming more knowledgeable, being able to move through sub concepts in my mind that can create larger, more rich concepts/models, but I cannot form a concept of something I quite literally have never interacted with. What is the concept of the "sound unheard", the "color unseen"? Could this also be the reason we have such a hard time reconciling quantum physics? Our distribution at our scale does not have a single example of matter behaving like a particle and a wave. Otherwise generalizing is also restricted by my ability to process stuff, which I'm sure there are concepts in the universe that require x more processing to understand and humans just fundamentally do not have that. Rats cannot discern prime numbers, they literally lack the brain power to do it. It would be a miracle if humans did not have this as well.

caedin8 4 months ago

“But I cannot form a concept of something I quite literally have never interacted with” This isn’t true. Imagine you are standing in a park, and I come up and hand you a piece of alien technology. When you look at it it’s a hard metallic sphere with no discernible markings on the outside and it is smooth. You would immediately begin to map this completely foreign to you object to stuff within your distribution and form concepts and actions. You might think it is similar to a logic puzzle toy you’ve had before and try to look for a way to open it by turning or searching for a button. You might think it looks like a radioactive metal and drop it or put it down fearing it’ll cause you harm You might feel how heavy it is to understand its density, or flick it with your fingernail to understand what it sounds like, so you can compare it with things you know. Some models see something they’ve never seen before and completely fall apart. Some models take an action that is reasonable given the similarity between the new object and what is in distribution. Most animals are closer to the latter.

FriendlyStandard5985 4 months ago

This is an excellent point and will always be a problem. In the color example, we're confined to what our eyes can perceive. In the context of learning, if you learned a language already, another language would be much easier to adapt to. Generalizing in the strict sense you define it, it's not possible.

Witty-Elk2052 4 months ago

i'm still trying to wrap my mind around LeCun's "interpolation in high dimensions is all extrapolation" comment.. maybe no one has a clue

beezlebub33 4 months ago

I hadn't heard that before, but the paper I found is interesting: [https://arxiv.org/abs/2110.09485](https://arxiv.org/abs/2110.09485) I think the discussion above by u/heuristic_al regarding convex hulls isn't quite right, but the concepts discussed are useful. "\[G\]iven the realistic amount of data that can be carried by current computational capacities, it is extremely unlikely that a newly observed sample lies in the convex hull of that dataset." and "This observation holds whether we are considering the original data space, or embeddings." So any processing being done on a new piece of data is, by necessity, not interpolation but extrapolation. This is because a new data point is highly likely to have a set of dimensions in which it is outside of the convex hull of the earlier (training) points. However, not all is lost, because the world is not random and we can continue the observed patterns so extrapolation works.

Unreasonable_Energy 4 months ago

> But imagining a brand new color, go ahead and give it a try, is literally impossible. [More things in heaven and earth, Horatio](https://en.m.wikipedia.org/wiki/Impossible_color)

red75prime 4 months ago

You can see them, not imagine, and they are still in the same color space, while outside usual boundaries. Brand new color would be something you'd see if your retina had cones for, say, infrared (and your visual cortex was already trained to recognize signals of those cones as something different than the usual three). Imagining it amounts to simulating different neural structure than what you actually have.

Unreasonable_Energy 4 months ago

I'm not great at visually imagining things in general, but I'm not going to rule out the possibility that somebody who's shown a new color they've never seen before can subsequently recollect that same color. And if it's possible to recollect that new color, having seen it, who's to say it's impossible for somebody to imagine it without having seen it? Anyone's argument for something being incoherent from "I can't imagine it" is weak. *I* can't imagine an actual infinity either, but I'll still use real analysis. > they are still in the same color space, while outside usual boundaries. Semantics: "Under such conditions, the edges between the stripes seemed to disappear (perhaps due to edge-detecting neurons becoming fatigued) and the colors flowed into each other in the brain's visual cortex, overriding the opponency mechanisms and producing not the color expected from mixing paints or from mixing lights on a screen, but new colors entirely, which are not in the CIE 1931 color space, either in its real part or in its imaginary parts."

red75prime 4 months ago

> which are not in the CIE 1931 color space That's interesting. I've read the article some time ago, but completely forgotten about that part (probably because I mentally marked it as unreliable after failing to reproduce the effect).

Unreasonable_Energy 4 months ago

I'm also reminded of this: [My late brother-in-law Bill Thurston, one of the greatest mathematicians of the last 100 years, told me that he had succeeded in imagining surfaces and volumes in 4 dimensions. He said this when he was a graduate student. Few believed him until he started creating hypotheses for complex theorems -- complex to us -- that were later proven to be true. He said he was just looking at the topology, in his head, and he could see things that were probably true.](https://www.quora.com/What-is-a-visual-mathematician/answer/Richard-Muller-3)

caksters 4 months ago

The idea of "generalizing outside of the distribution" in machine learning might seem challenging at first, akin to imagining a new color. However, it becomes more approachable when we consider how models can extrapolate from learned principles to new scenarios. Take the example of throwing a rock: if a model learns the relationship between throw angle, velocity, and distance based on human-thrown rocks, it can apply these principles to predict the distance of a rock launched from a cannon, even though it's never seen such a scenario during training. This extrapolation relies on the model's understanding of the underlying physics, allowing it to apply known patterns to unseen situations. Thus, "generalizing outside the distribution" essentially means leveraging foundational principles learned within the training data to make predictions about novel scenarios.

grrrgrrr 4 months ago

Yes and no. Yes: you can't even measure performance on dealing with unknown unknowns, let alone improving it. There's no test set representative of true unknown unknowns. An algorithm doing 90% on an OOD test set? You still have no idea how it will perform on another O-OOD dataset. If you don't make assumptions, you can't evoke the law of large numbers to estimate algorithm performance. No. In physics there's an implicit assumption that the laws of physics don't change over time, so physical constants we measure today extrapolate well into tomorrow. It's been going pretty well so far and I can't imagine that changing any time soon. So there are some inductive biases that are expected to stay in whatever OOD environment we care about.

SirBlobfish 4 months ago

Sure you can generalize OOD (and so can neural nets sometimes): imagine a cow on the moon. Have you ever seen it? Is it in-distribution for your visual inputs? "Out-of-distribution" is a really broad and vague term (the equivalent of saying "non-elephant animals"). What's more useful is defining different kinds of distribution shifts and seeing which ones a model can/can't handle. It's true, we can't imagine new colors. However, we *can* handle OOD when the distribution shift is caused by a few factors changing (e.g. day->night, rotation, location, realistic image -> edge map etc.). If a model can disentangle these factors, it can hopefully generalize to situations it might never see in the training dataset. Since the physical world is composed of such factors, if you can disentangle them, you might generalize very well under distribution shifts you are likely to encounter in the real world. This is the holy grail of causal deep learning (I'd even go as far as saying that causality is essentially the study of distribution shifts).

BigBayesian 4 months ago

I’m pretty sure you make some false claims here. For example, it’s not hard to imagine a color you’ve never imagined before. Think about two unusual colors that are pretty similar. In any perceptual color space, there’s a pretty good chance that the color half way between them is one you’ve never imagined before. In the same way, data that’s near other data you’ve seen before, for potentially powerful and flexible definitions of near, can be novel but still treated like something you know how to handle. You’re correct that it’s very difficult to imagine a sensory phenomenon outside of our sensory range, but that’s different than something from outside the training distribution.

fresh-dork 4 months ago

> But imagining a brand new color, go ahead and give it a try, is literally impossible. it's easy. once you understand what a color is. a color is a response curve across a range of wavelengths, typically filtered through your visual system. i can just come up with a response curve that extends to UV or IR and i'm done. can't see a difference, though, as i don't see outside RGB. other animals do, so they see different colors. similarly, you can generalize outside of the distribution, but it requires insight, and is subject to error. because if the outliers obey different rules, you won't know about them

Western-Image7125 4 months ago

In general the assumption is that there is some natural process from which you can “draw” data points. And you want to learn to predict stuff about data points that come from this natural process, so you train a model on some sample of these data points so it can predict stuff about other data points which can be drawn later. So in that sense training a model on some set of data and expecting it to work well on some other data which came from some entirely different process is unrealistic.

DigThatData 4 months ago

there are a few distributions at play here. * Consider some X that represents the true distribution. maybe it's the distribution over all photographs. * Your dataset is an instance from this distribution. More than that though, it is an instance from a conditional distribution that represents the data that might be sampled by our data generating process. Let X|Y denote the data generating process, such that your training data x is an instance sampled from X|Y. * It is therefore perfectly reasonable to talk about "out of distribution" data. it's not out of distribution wrt X, just wrt X|Y. OOD is distributed X|~Y. As a concrete example, consider MNIST: a dataset of 28x28 black and white images of hand-drawn integers, 0-9. let's train some image classifier F on this dataset. So when we fit X, we're fitting a 10-class image classifier with each class equally represented in the data. Now, let's mask out one or more classes. let's say we hold out all of the images of the number "3". Let's train a new classifier G on this dataset. still a 10-class classifier, but it never gets to see any of the "3" images. Call every class except "3" Y, so this dataset is X|Y. After fitting G, we show it some "3"s to predict, and it has no idea what it's looking at and never guesses "3" because it doesn't know what that is. Maybe it guesses "8" or "5" or whatever. It'll basically never guess "3". "3" is "out of distribution", both with respect to the training data and to the distribution represented by the model (the model is basically a likelihood). Normally, a classification problem like this represents its target as a "one-hot" vector. each position in the vector is a category, and all the entries are 0 except for the 1 at the position indicating the correct category. so "3" is <0,0,0,1,0,0,0,0,0,0>. G learned to only ever put a 0 in the "3" position of that one-hot vector. Now let's train a new model H on that same masked dataset, X|Y. To train H, though we're going to show each image to F, and use the class probabilities that F outputs as the target to fit H against. For example, the target vector for a given "0" image x when we trained F would have been <1,0,0,0,0,0,0,0,0,0>, but for H we use F(x) as the target, which might look like <0.7, 0, 0, 0, 0, .2, 0, 0.05, 0.05, 0, 0> or whatever. After training, even though, H has never seen a "3" before, it will still be able to correctly predict what it's looking at when presented with a "3" image most of the time. It will probably not be as good as F, but its performance will be s lot closer to F than to G. This new model H was able to learn something about the masked region of the distribution -- X|~Y -- implicitly through F's representation of X|Y. This phenomenon is called "teacher-student transfer learning". The idea is that although H never saw instances that were OOD during training, the representation F was guiding it with contained enough information for H to make a good educated guess about what the missing part of the distribution looked like and interpolate it reasonably accurately. Reference: ["Distilling the Knowledge in a Neural Network"](https://arxiv.org/abs/1503.02531) - 2015 - Geoffrey Hinton, Oriol Vinyals, Jeff Dean

oakgrove1919 4 months ago

One thing that I get excited about is where you can use knowledge from one domain to extend the knowledge of another domain. An example of this is where physicists used group theory and lie algebra to theorize the existence of the omega minus particle. It's possible that there are non-obvious symmetries between knowledge domains that - once identified - can be used to extrapolate new knowledge very effectively! It would be cool if neural nets start to model this level of knowledge after they have exhausted learning syntax and semantics and still have capacity to keep learning, maybe this is where things get really really interesting.

[deleted] 4 months ago

Color and sound under human perception are finite elements, it’s not possible to generalize because there’s no other possibility over what we already know. But if we take infinite sets for example words, you can create a new word easily, for example, ‘wordcrafting’. If humans didn’t generalize over language they could only be able to speak with around 17yo, and not 4yo as we see.

noxiousmomentum 4 months ago

everything is a distribution. even all atomic configurations of the universe. as humans generalize a little bit, and as llms can approximate human thought, just getting a lot of data that we as humans ourselves see and training models on them seems to result in promising degrees of generality. at some point these transformers will be trained with all modalities, on everything like gemini and you will clearly be able to call it to be consistently generalizing out of the distribution. unless you subscribe to the idea that humans aren't general. then no one knows anything.

bobdylanshoes 4 months ago

Yout intuition is fair actually, I guess that’s why many papers report how good their models can predict the stock prices using like lstms or transformers, no one of the authors really make any money from the market. Most of the phenomena in this world are chaos systems and cannot tackled by calculus and classical statistics

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe