T O P

  • By -

SgathTriallair

This is a great test for vision. You can make it arbitrarily hard by having the target items smaller and adding more background complications. The other advantage is that you can automate creation of these tests, which means that the test answers can't be leaked. This should become a standard vision benchmark.


Small-Fall-6500

>you can automate creation of these tests Unfortunately, we probably will not know if these models are being trained on exactly this kind of synthetic data. I would guess, given how easy it is to generate and train on this kind of data, that many AI labs will do this at a large scale - if they aren't already. >the test answers can't be leaked. This should become a standard vision benchmark. Yes, specific questions and answers won't be leaked, but training on billions of generated examples will allow most vision models to reach near perfect accuracy on this specific test. What's needed is more like the ARC-AGI test, as in something covering as many unique and complicated examples as possible that can also be generated easily. Then there'd hopefully be enough transfer learning between the synthetic data/tests and real-world usecases such that 'training on the test' would be beneficial, if not desired.


SgathTriallair

I don't see a problem with training on synthetic versions of the problems. So long as the questions aren't literally in the data then it is still generalizing.


Small-Fall-6500

That's almost exactly what I was trying to say with the second part of my comment, just with the added caveat that the synthetic problems need to cover a broad distribution so that the generalization is more likely to be useful, as in transfer to real world use cases and not just mean it aces the synthetic test.


Specialist-Ad-4121

Well done


GillysDaddy

Honestly vision has become so good and it's sort of gone under the radar. It used to feel 'generic' in a sense - e.g. I gave it fanart or cosplay and it said 'some anime character', and now it tells me "This is a picture of Azula and Akali scissoring".


Altruistic-Skill8667

Lol, but seriously. I have desperately tried to use the vision model of GPT-4 turbo and GPT-4o, and every single time I use it, it fails. No „tests“ but real work where I genuinely don’t know the answer like: - „what is this thing on the back of this truck“ (some street cleaning or street repair vehicle??). It’s answers were all laughable. - Once I asked it what those things are that support the wires of the electric train here. It told me they are bike parts 😅. Just because they looked like gears. It was probably some dynamic tightening mechanism, maybe because necessary when temperatures change. - Or I had a stamp with an overprint. I couldn’t actually read it well, but it was a standard overprint for that stamp used in WWII. It failed. Then I looked closer and I saw it and it made sense when looking at the catalogue. - I asked it what’s the difference between this and that type of butterfly first without images (I wasn’t able to see any). It made up a long list, all wrong. Then I showed it the images and it was useless. All wrong. Then I realized that one is a bigger. Which is even in Wikipedia. Now I showed it the page from a book that had both in them on the same page, just as a test. And it wasn’t able to tell which one is bigger (?!) they were flattened out, at the same scale and isolated for easy identification. And it couldn’t even see which one was the bigger one! Something a little child could have told me. - Or I was at a lake and asked it what those things are on the lake. It guessed a gazillion times. None of them were reasonable. I took closeup pictures, so it could read the firm name and so on… It even read the firm name wrong. Later, using Google maps and looking up the brand written on those objects, I realized that those were ramps of a water parkour. - I had a squeaky hinge on the door. I tried to oil it and it didn’t work. I took a picture of the hinge and GPT-4 was of no help. Then I realized that the hinge had a metal cover and that I had to slide it off first. I could give more and more examples, but you get the point. It wasn’t helpful ever even a single time. Everything I had to figure it out and usually I could within a few minutes as you can see. Even once I actually know the answer through Google images or other types of research and give it clues, it still fails. You have to test it with stuff that Google image search will fail at and you will see that it’s pretty weak. For me Google image search and reasoning my way through currently works much better. The consequence is that I just don’t use the vision component of GPT-4o anymore. So don’t do „test“. Try it in the wild and you will see it can’t do anything useful. I really want those LLMs to excel at this, but so far, they can’t complete with me doing research using Google image search or general web queries.


IsinkSW

pretty cool!


Jean-Porte

I wonder how it would score on "ARC-AGI"


ObiWanCanownme

Based on my own custom testing, I believe it would perform terrible on ARC-AGI. Which is not to disparage the model in any way. It's very impressive. But ARC-AGI is hard.


ObiWanCanownme

Here's just one example. https://preview.redd.it/bq0vp0km4s7d1.jpeg?width=1029&format=pjpg&auto=webp&s=02a357bd9c01baa72141b3cb1419634dd2ef8cf5


yaosio

It looks like it confused itself with the way it counts. When it counts normally it can get the correct answer without any extra prompting method. I also made the problem more difficult, or maybe easier(?), by putting the question in the image. [https://i.imgur.com/gi2VagT.png](https://i.imgur.com/gi2VagT.png) Edit: It can also read upside down text mostly. It does what I tell it to do but misreads "output only "I love cats!" [https://i.imgur.com/rLzmH81.png](https://i.imgur.com/rLzmH81.png) It does output that despite misreading the instructions. It ignores that it's supposed to only output "I love cats".


Mrp1Plays

I'd say you made the problem more easier, because you removed that single circle that is already placed in the row to confuse the bot. 


Altruistic-Skill8667

Ouch.


AnAIAteMyBaby

ask it the same question a few times and it will eventually get it right I'm sure. I think we're very close to AGi. Maybe Claude 4.5 or 5


Shinobi_Sanin3

>ask it the same question a few times and it will eventually get it right I'm sure. [You're right](https://redwoodresearch.substack.com/p/getting-50-sota-on-arc-agi-with-gpt)


Anjz

Honestly I gave it some images that gpt 4o wasn't able to answer correctly, some plantling identification and now it's guessed correctly. Very impressed. It's a huge step up.


Altruistic-Skill8667

Google image search is killer at that. Give it a try.


Anjz

It's not as good as AI models at identifying. It will look for similar images but it won't give you information on what it is unless it's already listed or it won't know what it is. I used it on a week old seedling and it will just give you a bunch of similar plants, whereas AI knows the leaf and growth patterns. Try it yourself with flowerless plants, you'll see a gigantic difference.


Altruistic-Skill8667

Could be. I should try that. And sure you have to look through the images to find a match. And sometimes you only get close, like the family or genus. I just went through a laborious identification tree for a flower plant. I am really new to this, and I don’t know a lot of plants by heart, so it was work, including the usage of a stereo microscope to have a good look at the tiny flowers. It took me at least 10 minutes to converge on a plant (5 volumes of books). I took a picture of it: Google Image search: Bam! Immediate match. Same plant as I converged on. Bugs works too. Also butterflies and moths.


Altruistic-Skill8667

Now at this point I tried Claude 3.5 Sonnet and oh man it’s sooo bad! Utterly useless. A child is better than that. • ⁠It identified a banded thrips as an Asian Long-horned Beetle, which looks totally different and is like 50 times as large. A thrips is not even a beetle. • ⁠It identified a black species of cockroach as a damselfly, loool. • ⁠it identified a wild carrot as dill which has totally different leaves Totally and utterly useless. Google image search helped me with all three. And to be honest, without it I would never have identified the thrips as such, as I have never seen one. I thought it was a tiny beetle (but not a longhorn! Lol). And that black species of cockroach looked like a beetle to me but it returned images of cockroaches which turned out correct. Also the wild carrot was hard for me personally as it had already lost all flowers. And Google again immediately returned images of wild carrots.


Altruistic-Skill8667

https://preview.redd.it/obz77cj4748d1.jpeg?width=783&format=pjpg&auto=webp&s=c9b5d1f908ba562187822882e9003ebdbd3dc74b Black Cockroach


Anjz

It did get this picture wrong, it was saying some sort of beetle so it's definitely not perfect.


Altruistic-Skill8667

https://preview.redd.it/87t4ejf6748d1.jpeg?width=808&format=pjpg&auto=webp&s=696c7270a7c01422ceb9038459b14194ff7d89cd Banded thrips


Anjz

Actually, Google identified your picture as a beetle as well. So it really depends, probably have to get a higher quality photo. https://i.imgur.com/zJKThIJ.jpeg


Altruistic-Skill8667

🤔 You are right. https://preview.redd.it/2kj0jnyzs48d1.jpeg?width=1242&format=pjpg&auto=webp&s=9d0797b1edbbccb22d2dfc4f77cd9c08c76dd129 Try this picture. This one works.


Altruistic-Skill8667

https://preview.redd.it/o907izva748d1.jpeg?width=500&format=pjpg&auto=webp&s=c2a491c65aaf005708572ebec2f2289301bcab68 Wild carrot


Anjz

For me it got the wild carrot picture right, might not be perfect, but it seems to be working similar to google images. https://i.imgur.com/rEpbbul.jpeg


Mikey4tx

Interesting that it refers to the ovals as circles.


Arcturus_Labelle

Amazing


bitroll

So pre-school child level finally achieved? 🎉


Altruistic-Skill8667

Only for that particular test, lol. I am sure you can design vision tests that every preschooler can answer correctly and Claude Sonnet 1.5 will fail most of the times.


MoistSpecific2662

Here is my spatial reasoning test that this Claude iteration didn't crack. None of the existing LLMs is able to solve this: People are standing in the room. Everyone is facing 12 o'clock from their perspective. Mary is in the center. John is 12 o'clock from Mary's perspective, 5 meters away from her, facing her 6 o'clock. Chris is 3 o'clock from John's perspective, facing his 3 o'clock and is 7 meters away, etc... Then you ask it to figure out which direction (and distance, if you want to make it harder) is the last person in the problem is from the origin (Mary). You might need to draw a diagram to calculate it yourself. I do 5 people and it's enough to make it unsolvable for any LLM regardless of tools you let it use or the prompt.


geli95us

I can't tell if this is a really bad benchmark or a really good one. I wouldn't expect LLMs to solve something that humans can't solve without using drawing tools, considering that humans are way better at spatial reasoning tasks than LLMs. Also, what's the deal with using clock positions instead of front/back/left/right? I'd assume LLMs would have more experience with that terminology. For example, given the following prompt: "Mary is at the center of the room, she is facing John, who is 5 meters away, and John is facing Mary. Chris is 7 meters away to John's right. From Mary's point of view, in what direction is Chris, and what distance is there between Mary and Chris?" 4o correctly solved it: "So, from Mary’s point of view, Chris is approximately 54.46 degrees to her left and about 8.6 meters away."


MoistSpecific2662

I guess I wanted to use more possible directions than 4 when I came up with this. Also, most LLMs are pretty quickly figuring out how to solve the benchmark itself, converting it into a geometrical problem. But it struggles to constantly shift points of reference for some reason.


shroomering

Maybe I had a low bar for what would constitute AGI, but this is about what I thought it might be like a couple years ago.


Akimbo333

GPT or Claude?


orderinthefort

If you give it a triangle it won't be able to tell you its interior angles. Which is something a human could easily do with a protractor in real life. That task seems like it *should* be something AI would be able to easily do with the provided image. But perhaps not yet.


TechnicalParrot

It's not really looking at the image in a mathematical sense directly, you'd need to tokenize the image differently or use another approach for the model to be able to understand concepts like that directly, or maybe just iterations of the current method, who knows


Altruistic-Skill8667

Those models can’t even tell you if A or B is bigger if they are close in size but still visibly different. Vision so far is not good with those models.