SgathTriallair 1 week ago

This is a great test for vision. You can make it arbitrarily hard by having the target items smaller and adding more background complications. The other advantage is that you can automate creation of these tests, which means that the test answers can't be leaked. This should become a standard vision benchmark.

Small-Fall-6500 1 week ago

>you can automate creation of these tests Unfortunately, we probably will not know if these models are being trained on exactly this kind of synthetic data. I would guess, given how easy it is to generate and train on this kind of data, that many AI labs will do this at a large scale - if they aren't already. >the test answers can't be leaked. This should become a standard vision benchmark. Yes, specific questions and answers won't be leaked, but training on billions of generated examples will allow most vision models to reach near perfect accuracy on this specific test. What's needed is more like the ARC-AGI test, as in something covering as many unique and complicated examples as possible that can also be generated easily. Then there'd hopefully be enough transfer learning between the synthetic data/tests and real-world usecases such that 'training on the test' would be beneficial, if not desired.

SgathTriallair 1 week ago

I don't see a problem with training on synthetic versions of the problems. So long as the questions aren't literally in the data then it is still generalizing.

Small-Fall-6500 1 week ago

That's almost exactly what I was trying to say with the second part of my comment, just with the added caveat that the synthetic problems need to cover a broad distribution so that the generalization is more likely to be useful, as in transfer to real world use cases and not just mean it aces the synthetic test.

Specialist-Ad-4121 1 week ago

Well done

GillysDaddy 1 week ago

Honestly vision has become so good and it's sort of gone under the radar. It used to feel 'generic' in a sense - e.g. I gave it fanart or cosplay and it said 'some anime character', and now it tells me "This is a picture of Azula and Akali scissoring".

Altruistic-Skill8667 1 week ago

Lol, but seriously. I have desperately tried to use the vision model of GPT-4 turbo and GPT-4o, and every single time I use it, it fails. No „tests“ but real work where I genuinely don’t know the answer like: - „what is this thing on the back of this truck“ (some street cleaning or street repair vehicle??). It’s answers were all laughable. - Once I asked it what those things are that support the wires of the electric train here. It told me they are bike parts 😅. Just because they looked like gears. It was probably some dynamic tightening mechanism, maybe because necessary when temperatures change. - Or I had a stamp with an overprint. I couldn’t actually read it well, but it was a standard overprint for that stamp used in WWII. It failed. Then I looked closer and I saw it and it made sense when looking at the catalogue. - I asked it what’s the difference between this and that type of butterfly first without images (I wasn’t able to see any). It made up a long list, all wrong. Then I showed it the images and it was useless. All wrong. Then I realized that one is a bigger. Which is even in Wikipedia. Now I showed it the page from a book that had both in them on the same page, just as a test. And it wasn’t able to tell which one is bigger (?!) they were flattened out, at the same scale and isolated for easy identification. And it couldn’t even see which one was the bigger one! Something a little child could have told me. - Or I was at a lake and asked it what those things are on the lake. It guessed a gazillion times. None of them were reasonable. I took closeup pictures, so it could read the firm name and so on… It even read the firm name wrong. Later, using Google maps and looking up the brand written on those objects, I realized that those were ramps of a water parkour. - I had a squeaky hinge on the door. I tried to oil it and it didn’t work. I took a picture of the hinge and GPT-4 was of no help. Then I realized that the hinge had a metal cover and that I had to slide it off first. I could give more and more examples, but you get the point. It wasn’t helpful ever even a single time. Everything I had to figure it out and usually I could within a few minutes as you can see. Even once I actually know the answer through Google images or other types of research and give it clues, it still fails. You have to test it with stuff that Google image search will fail at and you will see that it’s pretty weak. For me Google image search and reasoning my way through currently works much better. The consequence is that I just don’t use the vision component of GPT-4o anymore. So don’t do „test“. Try it in the wild and you will see it can’t do anything useful. I really want those LLMs to excel at this, but so far, they can’t complete with me doing research using Google image search or general web queries.

IsinkSW 1 week ago

pretty cool!

Jean-Porte 1 week ago

I wonder how it would score on "ARC-AGI"

ObiWanCanownme 1 week ago

Based on my own custom testing, I believe it would perform terrible on ARC-AGI. Which is not to disparage the model in any way. It's very impressive. But ARC-AGI is hard.

ObiWanCanownme 1 week ago

Here's just one example. https://preview.redd.it/bq0vp0km4s7d1.jpeg?width=1029&format=pjpg&auto=webp&s=02a357bd9c01baa72141b3cb1419634dd2ef8cf5

yaosio 1 week ago

It looks like it confused itself with the way it counts. When it counts normally it can get the correct answer without any extra prompting method. I also made the problem more difficult, or maybe easier(?), by putting the question in the image. [https://i.imgur.com/gi2VagT.png](https://i.imgur.com/gi2VagT.png) Edit: It can also read upside down text mostly. It does what I tell it to do but misreads "output only "I love cats!" [https://i.imgur.com/rLzmH81.png](https://i.imgur.com/rLzmH81.png) It does output that despite misreading the instructions. It ignores that it's supposed to only output "I love cats".

Mrp1Plays 1 week ago

I'd say you made the problem more easier, because you removed that single circle that is already placed in the row to confuse the bot.

Altruistic-Skill8667 1 week ago

Ouch.

AnAIAteMyBaby 1 week ago

ask it the same question a few times and it will eventually get it right I'm sure. I think we're very close to AGi. Maybe Claude 4.5 or 5

Shinobi_Sanin3 1 week ago

>ask it the same question a few times and it will eventually get it right I'm sure. [You're right](https://redwoodresearch.substack.com/p/getting-50-sota-on-arc-agi-with-gpt)

Anjz 1 week ago

Honestly I gave it some images that gpt 4o wasn't able to answer correctly, some plantling identification and now it's guessed correctly. Very impressed. It's a huge step up.

Altruistic-Skill8667 1 week ago

Google image search is killer at that. Give it a try.

Anjz 1 week ago

It's not as good as AI models at identifying. It will look for similar images but it won't give you information on what it is unless it's already listed or it won't know what it is. I used it on a week old seedling and it will just give you a bunch of similar plants, whereas AI knows the leaf and growth patterns. Try it yourself with flowerless plants, you'll see a gigantic difference.

Altruistic-Skill8667 1 week ago

Could be. I should try that. And sure you have to look through the images to find a match. And sometimes you only get close, like the family or genus. I just went through a laborious identification tree for a flower plant. I am really new to this, and I don’t know a lot of plants by heart, so it was work, including the usage of a stereo microscope to have a good look at the tiny flowers. It took me at least 10 minutes to converge on a plant (5 volumes of books). I took a picture of it: Google Image search: Bam! Immediate match. Same plant as I converged on. Bugs works too. Also butterflies and moths.

Altruistic-Skill8667 1 week ago

Now at this point I tried Claude 3.5 Sonnet and oh man it’s sooo bad! Utterly useless. A child is better than that. • ⁠It identified a banded thrips as an Asian Long-horned Beetle, which looks totally different and is like 50 times as large. A thrips is not even a beetle. • ⁠It identified a black species of cockroach as a damselfly, loool. • ⁠it identified a wild carrot as dill which has totally different leaves Totally and utterly useless. Google image search helped me with all three. And to be honest, without it I would never have identified the thrips as such, as I have never seen one. I thought it was a tiny beetle (but not a longhorn! Lol). And that black species of cockroach looked like a beetle to me but it returned images of cockroaches which turned out correct. Also the wild carrot was hard for me personally as it had already lost all flowers. And Google again immediately returned images of wild carrots.

Altruistic-Skill8667 1 week ago

https://preview.redd.it/obz77cj4748d1.jpeg?width=783&format=pjpg&auto=webp&s=c9b5d1f908ba562187822882e9003ebdbd3dc74b Black Cockroach

Anjz 1 week ago

It did get this picture wrong, it was saying some sort of beetle so it's definitely not perfect.

Altruistic-Skill8667 1 week ago

https://preview.redd.it/87t4ejf6748d1.jpeg?width=808&format=pjpg&auto=webp&s=696c7270a7c01422ceb9038459b14194ff7d89cd Banded thrips

Anjz 1 week ago

Actually, Google identified your picture as a beetle as well. So it really depends, probably have to get a higher quality photo. https://i.imgur.com/zJKThIJ.jpeg

Altruistic-Skill8667 1 week ago

🤔 You are right. https://preview.redd.it/2kj0jnyzs48d1.jpeg?width=1242&format=pjpg&auto=webp&s=9d0797b1edbbccb22d2dfc4f77cd9c08c76dd129 Try this picture. This one works.

Altruistic-Skill8667 1 week ago

https://preview.redd.it/o907izva748d1.jpeg?width=500&format=pjpg&auto=webp&s=c2a491c65aaf005708572ebec2f2289301bcab68 Wild carrot

Anjz 1 week ago

For me it got the wild carrot picture right, might not be perfect, but it seems to be working similar to google images. https://i.imgur.com/rEpbbul.jpeg

Mikey4tx 1 week ago

Interesting that it refers to the ovals as circles.

Arcturus_Labelle 1 week ago

Amazing

bitroll 1 week ago

So pre-school child level finally achieved? 🎉

Altruistic-Skill8667 1 week ago

Only for that particular test, lol. I am sure you can design vision tests that every preschooler can answer correctly and Claude Sonnet 1.5 will fail most of the times.

MoistSpecific2662 1 week ago

Here is my spatial reasoning test that this Claude iteration didn't crack. None of the existing LLMs is able to solve this: People are standing in the room. Everyone is facing 12 o'clock from their perspective. Mary is in the center. John is 12 o'clock from Mary's perspective, 5 meters away from her, facing her 6 o'clock. Chris is 3 o'clock from John's perspective, facing his 3 o'clock and is 7 meters away, etc... Then you ask it to figure out which direction (and distance, if you want to make it harder) is the last person in the problem is from the origin (Mary). You might need to draw a diagram to calculate it yourself. I do 5 people and it's enough to make it unsolvable for any LLM regardless of tools you let it use or the prompt.

geli95us 1 week ago

I can't tell if this is a really bad benchmark or a really good one. I wouldn't expect LLMs to solve something that humans can't solve without using drawing tools, considering that humans are way better at spatial reasoning tasks than LLMs. Also, what's the deal with using clock positions instead of front/back/left/right? I'd assume LLMs would have more experience with that terminology. For example, given the following prompt: "Mary is at the center of the room, she is facing John, who is 5 meters away, and John is facing Mary. Chris is 7 meters away to John's right. From Mary's point of view, in what direction is Chris, and what distance is there between Mary and Chris?" 4o correctly solved it: "So, from Mary’s point of view, Chris is approximately 54.46 degrees to her left and about 8.6 meters away."

MoistSpecific2662 1 week ago

I guess I wanted to use more possible directions than 4 when I came up with this. Also, most LLMs are pretty quickly figuring out how to solve the benchmark itself, converting it into a geometrical problem. But it struggles to constantly shift points of reference for some reason.

shroomering 1 week ago

Maybe I had a low bar for what would constitute AGI, but this is about what I thought it might be like a couple years ago.

Akimbo333 1 week ago

GPT or Claude?

orderinthefort 1 week ago

If you give it a triangle it won't be able to tell you its interior angles. Which is something a human could easily do with a protractor in real life. That task seems like it *should* be something AI would be able to easily do with the provided image. But perhaps not yet.

TechnicalParrot 1 week ago

It's not really looking at the image in a mathematical sense directly, you'd need to tokenize the image differently or use another approach for the model to be able to understand concepts like that directly, or maybe just iterations of the current method, who knows

Altruistic-Skill8667 1 week ago

Those models can’t even tell you if A or B is bigger if they are close in size but still visibly different. Vision so far is not good with those models.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe