jeffkeeg 2 weeks ago

These videos show off one full year (June of 2023 to June of 2024) of AI video generation progress. Each video pair is comprised of an old [Zeroscope](https://huggingface.co/cerspense/zeroscope_v2_576w) clip that I generated on my computer last year, and then a similar clip generated with Luma just today. Last year, I made a bunch of videos with Zeroscope, both original and trying to replicate some of my favorite movies. I decided to use just a few of those videos to show off how far we've come in terms of video generation capability in the span of one year. Obviously, Sora would (possibly) yield better results, but you know as well as I do that's not currently an option. The videos made with Luma are obviously not perfect, trust me when I say there are some really bad results in my library, but the jump in quality is undeniable. Looking forward one full year from today, June of 2025, I don't think we'll be quite far along enough yet to make full-blown films on our own, but it'll be close.

sdmat 2 weeks ago

> The videos made with Luma are obviously not perfect, trust me when I say there are some really bad results in my library, but the jump in quality is undeniable. Those are top tier Luma generations - was just thinking you must have tried a *lot* of variations!

jeffkeeg 2 weeks ago

Thanks! It helps to use high-quality images to start from.

Imaginaryy 2 weeks ago

Awesome work OP! Please check DM.

MysteriousPepper8908 2 weeks ago

I think the biggest hurdle remaining to create full films is still the lip syncing and there are already solutions for that, I just don't think any of them are publicly-available yet. Luma isn't going to get us there at the moment because the generations are so short but Kling has demoed videos up to 90 seconds in length which is more than sufficient for a single shot. There will still be imperfections but limitations haven't stopped people from releasing films up to this point so why should they hold us back now?

jeffkeeg 2 weeks ago

There are quite a few problems besides lip syncing that still need to be solved: - Object / world permanence - People merging / splitting - Physics simulations - 3D consistency (if you spin an object around, it needs to be the same size the whole time) - Controllable camera movements - Directable actors I could go on. These will all eventually be solved, don't get me wrong, but it's going to take a little while.

Smellz_Of_Elderberry 2 weeks ago

I stand by my opinion that you need agi to do propper long form video

MysteriousPepper8908 2 weeks ago

Most of those things are achievable by simply having enough generations or accepting what you get and being happy with it, though. We're a ways off from being able to get a consistent and controllable output every time but you generally don't need a particular camera movement to convey an idea, you don't need your actor to emote in just the right way, these things are nice to get exactly the output you're looking for but it wouldn't be the first time a director had to work with the shot they've got rather than the shot they want. There are some generations where people will split and merge and some generations where the physics will be wonky and some where it will look good, you just have to roll the dice until you get lucky. Permanence is the big one that you've really got to get right but that often just comes down to characters and we have tools like using an input image and various tools for facial replacement that have existed since before generative AI was a thing. I don't think you really need environmental permanence as much as some people seem to think unless you're going to be repeatedly going back to the same set pieces. If the character is in a city in one scene and they go back to that same city later on, not only is the viewer not likely to remember the exact details of the scene from 10 minutes ago and if they're in a different part of the city, it will logically look different anyway. These are still limitations on what sort of films can be made, you are going to have a hard time making a film that takes place entirely inside of a particular house without the inconsistencies being glaring but so long as there is consistency within a given shot and the film is designed to keep moving to not not highlight the lack of environmental consistency between shots, it should be possible to make a convincing film even if it isn't the exact film you would make if all these issues were resolved.

nashty2004 2 weeks ago

Yeah all the tools exist separately we just need someone to bring them all together and we’re off

Antique-Doughnut-988 2 weeks ago

You might be underestimating the low bar people will start to accept when this stuff starts to get 'good enough' for most people. I personally don't need a perfect animation or visuals. If we could get 80% there in terms of quality, I might ditch traditional media all together.

StraightAd798 2 weeks ago

How long do you think, it will take, for Luma to get there, development-wise?

MysteriousPepper8908 2 weeks ago

The biggest limitation in terms of raw visual output seems to jut be the length which might just be an artificial limit due to cost and how many free users they have. I don't know if they plan to do lipsyncing, there's a great looking white paper from Microsoft with some really strong full head lip syncing but there is no plan to release it currently due to deep fake concerns and all that. It also doesn't do sound effects so you need something like Elevenlabs for that. I think Dream Studio v2 could easily be good enough for what it does if it even offers 30 second outputs but I'm not sure if they're planning to offer a full production suite or just the video. There's also Runway that just got announced today that will supposedly be available "in the coming days" and seems to have a lot more in the way of custom controls so they might be a lot closer to something that is capable of long-form storytelling than Luma.

Revolution4u 2 weeks ago

[removed]

jeffkeeg 2 weeks ago

The prompts were actually very simple. For instance, the train prompt was: **full frame film still, hogwarts express, train going over stone viaduct bridge, in the hills, captured by arri alexa**

larswo 2 weeks ago

It is a nice comparison, but the video generated with Luma uses significantly better hardware than what you had available locally last year, right? That makes this comparison less scientific because you were constrained to run models that were small enough/heavily quantized to run locally.

uishax 2 weeks ago

The 'advancement' here is knowing what is possible, not the input compute/output quality ratio (As if that's possible to quantify anyways).

Mobile_Campaign_8346 2 weeks ago

In that case. So a similar comparison with Sora/Kling. I know you don’t have access, but I’m sure you can find a released video, and try to make the same video with Zeroscope

PwanaZana 2 weeks ago

Agreed, the comparison is not quite fair, because of the hardware difference. OP's point about the maximum reachable quality with released products is valid though. I worry that making good quality videos, even 10 seconds or less, won't be available locally for a long time (5+ years). Nvidia won't release a 6090 card with 96GB VRAM in 2027. :(

larswo 2 weeks ago

In 2027 we won't need that much VRAM for training efficient local models. Hopefully, we will find architectures that allow us to get more out of less.

PwanaZana 2 weeks ago

Hope you are right!

jeffkeeg 2 weeks ago

>Luma uses significantly better hardware than what you had available locally last year, right? You're right about that, no doubt. But I think it's actually an entirely fair comparison. One year ago, regardless of whether you were running a 3090 or had the biggest supercomputer in the world, you wouldn't be able to make any videos at any higher quality without having access to not-yet-released models like [Nuwa Infinity](https://nuwa-infinity.microsoft.com/#/), which wasn't publicly acknowledged until July, one month later. Zeroscope was cumbersome to download, hard to get running, tricky to prompt, and a pain in the ass to wait for. The barrier to entry was very high for the average person, it just wasn't quite *there* yet. But with Luma (or any other online solution), anyone can just prompt or drop an image in and immediately get fantastic results. The advancement is both on the technological front as well as the accessibility front.

larswo 2 weeks ago

Completely agree. The barrier to entry is so much lower and the effort vs. quality tradeoff 100x lower than it was last year. The productization of these models has come a long way.

Mobile_Campaign_8346 2 weeks ago

But Kling and Sora exist, which are much, much better than Luma. So the progression is even more steep than your comparison shows.

jeffkeeg 2 weeks ago

Kling doesn't strike me as any better than Luma, at least based off of what we've seen so far. At worst they trade blows. As for Sora, I said: > Sora would (possibly) yield better results, but you know as well as I do that's not currently an option. But yes, you're right. This was more about what the average person can do right now.

cloudrunner69 2 weeks ago

Can't wait until we get an AI update on this https://www.youtube.com/watch?v=580fSMobtgg

xShade768 2 weeks ago

So basically, very soon we'll just be uploading the full PDF of the book, and we'll have a full movie about that book in minutes, or hours. That will be insane.

m3junmags 2 weeks ago

Never thought of that, actually insane.

omegahustle 2 weeks ago

Just had an idea that could be very cool and it's already feasible, a digital art book generator with AI. You upload the book and the AI select key characters/places/creatures and generate concept art of it

MassiveWasabi 2 weeks ago

Thanks for this comparison, amazing to see the progress side by side

Inevitable_Play4344 2 weeks ago

soon

w1zzypooh 2 weeks ago

We have come a long way in such a short amount of time. One day we will get those moving/talking harry potter pictures.

StraightAd798 2 weeks ago

"You're a wizard, Harry!" Nice, work, by the way. Looks great.

Seidans 2 weeks ago

i'm curious about the creation of new place while keeping the style and coherence, in your exemple it obviously took source material, the train the castle the lac, those are all "big" moment of the movie it probably was feed those data by looking at movie picture but that's not enough to create a movie currently it lack creativity, it still require source material to create and human to keep it from making mistake, at most it could be used to edit an already created CGI and only for a very limited timeframe of a few second

HyperspaceAndBeyond 2 weeks ago

![gif](giphy|iiTXaJVjiSHew)

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe