New here?
Create an account to submit posts, participate in discussions and chat with people.
Sign up
posted 15 days ago by XBX_X on scored.co (+0 / -0 / +6Score on mirror )
You are viewing a single comment's thread. View all
XBX_X on scored.co
15 days ago 0 points (+0 / -0 ) 1 child
No, each added second does NOT double the needed VRAM required. Don't assume that the video needs to made concurrently. At the end of the day, a digital video is just a slide-show of still images. The 11-second limitation exists so that people don't make anything too wild or meaningful; as I mentioned, they're saving the full potential of this tech for high-dollar clients like ad agencies and film studios.
-2
part on scored.co
15 days ago -2 points (+0 / -0 / -2Score on mirror )
you are not correct, as of 2 weeks ago.

RAM limits clip size
  
for coherence all prior data is used for each frame... worse... its not linear.

Making long, good-looking videos with video diffusion, especially using next-frame prediction models, is tricky due to two core challenges: forgetting and drifting.

Forgetting occurs when the model fails to maintain long-range temporal consistency, losing details from earlier in the video.

Drifting, also known as exposure bias, is the gradual degradation of visual quality as initial errors in one frame propagate and accumulate across subsequent frames.

11 sec uses a shitload of RAM

one new hack is "FramePack"
  
https://arxiv.org/pdf/2504.12626
   
May 2025 FramePack:
======
   
FramePack is a Neural Network structure that introduces a novel anti-forgetting memory structure alongside sophisticated anti-drifting sampling methods to address the persistent challenges of forgetting and drifting in video synthesis. This combination provides a more robust and computationally tractable path towards high-quality, long-form video generation.

The central idea of FramePack’s approach to the forgetting problem is progressive compression of input frames based on their relative importance. The architecture ensures that the total transformer context length converges to a fixed upper bound, irrespective of the video’s duration. This pivotal feature allows the model to encode substantially more historical context without an escalating computational bottleneck, facilitating anti-forgetting directly.

FramePack system is built around Diffusion Transformers (DiTs) that generate a section of S unknown video frames, conditioned on T preceding input frames. It does not allow camera movement yet.

https://github.com/lllyasviel/FramePack

https://lllyasviel.github.io/frame_pack_gitpage/

TL/DR : XBX_X YOU ARE WRONG
  


Toast message