You are viewing a single comment's thread. View all
1
part on scored.co
15 days ago1 point(+0/-0/+1Score on mirror)2 children
> Well, they limit us to 11-second clips. And there's no doubt that the technology is capable of much, much more than what they allow us to know or use ourselves
No.
Each added second DOUBLES the needed VRAM!!!
I think 11 seconds is OVER 48 gigabytes needed for temporal coherency!
15 seconds is approaching half a terabyte.
= = = = =
workaround.... the last animation frame, if nothing in motion, can be used as start frame of a new 8 to 11 second clip.
= = = = =
A new invention might allow unlimited durations one day
15 days ago1 point(+0/-0/+1Score on mirror)1 child
It's a totally wrong assumption. The 11-second limitation exists so that people don't make anything too wild or meaningful; as I mentioned, they're saving the full potential of this tech for high-dollar clients like ad agencies and film studios. Imagine if you discovered fire. Now imagine if you could limit who gets to use fire and how much of it. You want people to know fire exists, and it's benefits, but you also want to make it scarce so that you can demand big bucks for it too.
you are not correct, as of 2 weeks ago.
for coherence all prior data is used for each frame... worse... its not linear.
Making long, good-looking videos with video diffusion, especially using next-frame prediction models, is tricky due to two core challenges: forgetting and drifting.
Forgetting occurs when the model fails to maintain long-range temporal consistency, losing details from earlier in the video.
Drifting, also known as exposure bias, is the gradual degradation of visual quality as initial errors in one frame propagate and accumulate across subsequent frames.
11 sec uses a shitload of RAM
one new hack is "FramePack"
May 2025 FramePack:
======
FramePack is a Neural Network structure that introduces a novel anti-forgetting memory structure alongside sophisticated anti-drifting sampling methods to address the persistent challenges of forgetting and drifting in video synthesis. This combination provides a more robust and computationally tractable path towards high-quality, long-form video generation.
The central idea of FramePack’s approach to the forgetting problem is progressive compression of input frames based on their relative importance. The architecture ensures that the total transformer context length converges to a fixed upper bound, irrespective of the video’s duration. This pivotal feature allows the model to encode substantially more historical context without an escalating computational bottleneck, facilitating anti-forgetting directly.
FramePack system is built around Diffusion Transformers (DiTs) that generate a section of S unknown video frames, conditioned on T preceding input frames. It does not allow camera movement yet.
No, each added second does NOT double the needed VRAM required. Don't assume that the video needs to made concurrently. At the end of the day, a digital video is just a slide-show of still images. The 11-second limitation exists so that people don't make anything too wild or meaningful; as I mentioned, they're saving the full potential of this tech for high-dollar clients like ad agencies and film studios.
for coherence all prior data is used for each frame... worse... its not linear.
Making long, good-looking videos with video diffusion, especially using next-frame prediction models, is tricky due to two core challenges: forgetting and drifting.
Forgetting occurs when the model fails to maintain long-range temporal consistency, losing details from earlier in the video.
Drifting, also known as exposure bias, is the gradual degradation of visual quality as initial errors in one frame propagate and accumulate across subsequent frames.
11 sec uses a shitload of RAM
one new hack is "FramePack"
https://arxiv.org/pdf/2504.12626
May 2025 FramePack:
======
FramePack is a Neural Network structure that introduces a novel anti-forgetting memory structure alongside sophisticated anti-drifting sampling methods to address the persistent challenges of forgetting and drifting in video synthesis. This combination provides a more robust and computationally tractable path towards high-quality, long-form video generation.
The central idea of FramePack’s approach to the forgetting problem is progressive compression of input frames based on their relative importance. The architecture ensures that the total transformer context length converges to a fixed upper bound, irrespective of the video’s duration. This pivotal feature allows the model to encode substantially more historical context without an escalating computational bottleneck, facilitating anti-forgetting directly.
FramePack system is built around Diffusion Transformers (DiTs) that generate a section of S unknown video frames, conditioned on T preceding input frames. It does not allow camera movement yet.
No.
Each added second DOUBLES the needed VRAM!!!
I think 11 seconds is OVER 48 gigabytes needed for temporal coherency!
15 seconds is approaching half a terabyte.
= = = = =
workaround.... the last animation frame, if nothing in motion, can be used as start frame of a new 8 to 11 second clip.
= = = = =
A new invention might allow unlimited durations one day
That's what's happening with AI.
for coherence all prior data is used for each frame... worse... its not linear.
Making long, good-looking videos with video diffusion, especially using next-frame prediction models, is tricky due to two core challenges: forgetting and drifting.
Forgetting occurs when the model fails to maintain long-range temporal consistency, losing details from earlier in the video.
Drifting, also known as exposure bias, is the gradual degradation of visual quality as initial errors in one frame propagate and accumulate across subsequent frames.
11 sec uses a shitload of RAM
one new hack is "FramePack"
May 2025 FramePack:
======
FramePack is a Neural Network structure that introduces a novel anti-forgetting memory structure alongside sophisticated anti-drifting sampling methods to address the persistent challenges of forgetting and drifting in video synthesis. This combination provides a more robust and computationally tractable path towards high-quality, long-form video generation.
The central idea of FramePack’s approach to the forgetting problem is progressive compression of input frames based on their relative importance. The architecture ensures that the total transformer context length converges to a fixed upper bound, irrespective of the video’s duration. This pivotal feature allows the model to encode substantially more historical context without an escalating computational bottleneck, facilitating anti-forgetting directly.
FramePack system is built around Diffusion Transformers (DiTs) that generate a section of S unknown video frames, conditioned on T preceding input frames. It does not allow camera movement yet.
https://github.com/lllyasviel/FramePack
https://lllyasviel.github.io/frame_pack_gitpage/
TL/DR : XBX_X YOU ARE WRONG
RAM limits clip size
for coherence all prior data is used for each frame... worse... its not linear.
Making long, good-looking videos with video diffusion, especially using next-frame prediction models, is tricky due to two core challenges: forgetting and drifting.
Forgetting occurs when the model fails to maintain long-range temporal consistency, losing details from earlier in the video.
Drifting, also known as exposure bias, is the gradual degradation of visual quality as initial errors in one frame propagate and accumulate across subsequent frames.
11 sec uses a shitload of RAM
one new hack is "FramePack"
https://arxiv.org/pdf/2504.12626
May 2025 FramePack:
======
FramePack is a Neural Network structure that introduces a novel anti-forgetting memory structure alongside sophisticated anti-drifting sampling methods to address the persistent challenges of forgetting and drifting in video synthesis. This combination provides a more robust and computationally tractable path towards high-quality, long-form video generation.
The central idea of FramePack’s approach to the forgetting problem is progressive compression of input frames based on their relative importance. The architecture ensures that the total transformer context length converges to a fixed upper bound, irrespective of the video’s duration. This pivotal feature allows the model to encode substantially more historical context without an escalating computational bottleneck, facilitating anti-forgetting directly.
FramePack system is built around Diffusion Transformers (DiTs) that generate a section of S unknown video frames, conditioned on T preceding input frames. It does not allow camera movement yet.
https://github.com/lllyasviel/FramePack
https://lllyasviel.github.io/frame_pack_gitpage/
TL/DR : XBX_X YOU ARE WRONG