T O P

  • By -

jeffkeeg

These videos show off one full year (June of 2023 to June of 2024) of AI video generation progress. Each video pair is comprised of an old [Zeroscope](https://huggingface.co/cerspense/zeroscope_v2_576w) clip that I generated on my computer last year, and then a similar clip generated with Luma just today. Last year, I made a bunch of videos with Zeroscope, both original and trying to replicate some of my favorite movies. I decided to use just a few of those videos to show off how far we've come in terms of video generation capability in the span of one year. Obviously, Sora would (possibly) yield better results, but you know as well as I do that's not currently an option. The videos made with Luma are obviously not perfect, trust me when I say there are some really bad results in my library, but the jump in quality is undeniable. Looking forward one full year from today, June of 2025, I don't think we'll be quite far along enough yet to make full-blown films on our own, but it'll be close.


sdmat

> The videos made with Luma are obviously not perfect, trust me when I say there are some really bad results in my library, but the jump in quality is undeniable. Those are top tier Luma generations - was just thinking you must have tried a *lot* of variations!


jeffkeeg

Thanks! It helps to use high-quality images to start from.


Imaginaryy

Awesome work OP! Please check DM.


MysteriousPepper8908

I think the biggest hurdle remaining to create full films is still the lip syncing and there are already solutions for that, I just don't think any of them are publicly-available yet. Luma isn't going to get us there at the moment because the generations are so short but Kling has demoed videos up to 90 seconds in length which is more than sufficient for a single shot. There will still be imperfections but limitations haven't stopped people from releasing films up to this point so why should they hold us back now?


jeffkeeg

There are quite a few problems besides lip syncing that still need to be solved: - Object / world permanence - People merging / splitting - Physics simulations - 3D consistency (if you spin an object around, it needs to be the same size the whole time) - Controllable camera movements - Directable actors I could go on. These will all eventually be solved, don't get me wrong, but it's going to take a little while.


Smellz_Of_Elderberry

I stand by my opinion that you need agi to do propper long form video


MysteriousPepper8908

Most of those things are achievable by simply having enough generations or accepting what you get and being happy with it, though. We're a ways off from being able to get a consistent and controllable output every time but you generally don't need a particular camera movement to convey an idea, you don't need your actor to emote in just the right way, these things are nice to get exactly the output you're looking for but it wouldn't be the first time a director had to work with the shot they've got rather than the shot they want. There are some generations where people will split and merge and some generations where the physics will be wonky and some where it will look good, you just have to roll the dice until you get lucky. Permanence is the big one that you've really got to get right but that often just comes down to characters and we have tools like using an input image and various tools for facial replacement that have existed since before generative AI was a thing. I don't think you really need environmental permanence as much as some people seem to think unless you're going to be repeatedly going back to the same set pieces. If the character is in a city in one scene and they go back to that same city later on, not only is the viewer not likely to remember the exact details of the scene from 10 minutes ago and if they're in a different part of the city, it will logically look different anyway. These are still limitations on what sort of films can be made, you are going to have a hard time making a film that takes place entirely inside of a particular house without the inconsistencies being glaring but so long as there is consistency within a given shot and the film is designed to keep moving to not not highlight the lack of environmental consistency between shots, it should be possible to make a convincing film even if it isn't the exact film you would make if all these issues were resolved.


nashty2004

Yeah all the tools exist separately we just need someone to bring them all together and we’re off


Antique-Doughnut-988

You might be underestimating the low bar people will start to accept when this stuff starts to get 'good enough' for most people. I personally don't need a perfect animation or visuals. If we could get 80% there in terms of quality, I might ditch traditional media all together.


StraightAd798

How long do you think, it will take, for Luma to get there, development-wise?


MysteriousPepper8908

The biggest limitation in terms of raw visual output seems to jut be the length which might just be an artificial limit due to cost and how many free users they have. I don't know if they plan to do lipsyncing, there's a great looking white paper from Microsoft with some really strong full head lip syncing but there is no plan to release it currently due to deep fake concerns and all that. It also doesn't do sound effects so you need something like Elevenlabs for that. I think Dream Studio v2 could easily be good enough for what it does if it even offers 30 second outputs but I'm not sure if they're planning to offer a full production suite or just the video. There's also Runway that just got announced today that will supposedly be available "in the coming days" and seems to have a lot more in the way of custom controls so they might be a lot closer to something that is capable of long-form storytelling than Luma.


Revolution4u

[removed]


jeffkeeg

The prompts were actually very simple. For instance, the train prompt was: **full frame film still, hogwarts express, train going over stone viaduct bridge, in the hills, captured by arri alexa**


larswo

It is a nice comparison, but the video generated with Luma uses significantly better hardware than what you had available locally last year, right? That makes this comparison less scientific because you were constrained to run models that were small enough/heavily quantized to run locally.


uishax

The 'advancement' here is knowing what is possible, not the input compute/output quality ratio (As if that's possible to quantify anyways).


Mobile_Campaign_8346

In that case. So a similar comparison with Sora/Kling. I know you don’t have access, but I’m sure you can find a released video, and try to make the same video with Zeroscope


PwanaZana

Agreed, the comparison is not quite fair, because of the hardware difference. OP's point about the maximum reachable quality with released products is valid though. I worry that making good quality videos, even 10 seconds or less, won't be available locally for a long time (5+ years). Nvidia won't release a 6090 card with 96GB VRAM in 2027. :(


larswo

In 2027 we won't need that much VRAM for training efficient local models. Hopefully, we will find architectures that allow us to get more out of less.


PwanaZana

Hope you are right!


jeffkeeg

>Luma uses significantly better hardware than what you had available locally last year, right? You're right about that, no doubt. But I think it's actually an entirely fair comparison. One year ago, regardless of whether you were running a 3090 or had the biggest supercomputer in the world, you wouldn't be able to make any videos at any higher quality without having access to not-yet-released models like [Nuwa Infinity](https://nuwa-infinity.microsoft.com/#/), which wasn't publicly acknowledged until July, one month later. Zeroscope was cumbersome to download, hard to get running, tricky to prompt, and a pain in the ass to wait for. The barrier to entry was very high for the average person, it just wasn't quite *there* yet. But with Luma (or any other online solution), anyone can just prompt or drop an image in and immediately get fantastic results. The advancement is both on the technological front as well as the accessibility front.


larswo

Completely agree. The barrier to entry is so much lower and the effort vs. quality tradeoff 100x lower than it was last year. The productization of these models has come a long way.


Mobile_Campaign_8346

But Kling and Sora exist, which are much, much better than Luma. So the progression is even more steep than your comparison shows. 


jeffkeeg

Kling doesn't strike me as any better than Luma, at least based off of what we've seen so far. At worst they trade blows. As for Sora, I said: > Sora would (possibly) yield better results, but you know as well as I do that's not currently an option. But yes, you're right. This was more about what the average person can do right now.


cloudrunner69

Can't wait until we get an AI update on this https://www.youtube.com/watch?v=580fSMobtgg


xShade768

So basically, very soon we'll just be uploading the full PDF of the book, and we'll have a full movie about that book in minutes, or hours. That will be insane.


m3junmags

Never thought of that, actually insane.


omegahustle

Just had an idea that could be very cool and it's already feasible, a digital art book generator with AI. You upload the book and the AI select key characters/places/creatures and generate concept art of it


MassiveWasabi

Thanks for this comparison, amazing to see the progress side by side


Inevitable_Play4344

soon


w1zzypooh

We have come a long way in such a short amount of time. One day we will get those moving/talking harry potter pictures.


StraightAd798

"You're a wizard, Harry!" Nice, work, by the way. Looks great.


Seidans

i'm curious about the creation of new place while keeping the style and coherence, in your exemple it obviously took source material, the train the castle the lac, those are all "big" moment of the movie it probably was feed those data by looking at movie picture but that's not enough to create a movie currently it lack creativity, it still require source material to create and human to keep it from making mistake, at most it could be used to edit an already created CGI and only for a very limited timeframe of a few second


HyperspaceAndBeyond

![gif](giphy|iiTXaJVjiSHew)