Consume Product - Arete Network

part on scored.co

1 month ago 1 point (+0 / -0 / ) 2 children

> Well, they limit us to 11-second clips. And there's no doubt that the technology is capable of much, much more than what they allow us to know or use ourselves

No.

Each added second DOUBLES the needed VRAM!!!

I think 11 seconds is OVER 48 gigabytes needed for temporal coherency!

15 seconds is approaching half a terabyte.

= = = = =

workaround.... the last animation frame, if nothing in motion, can be used as start frame of a new 8 to 11 second clip.

= = = = =

A new invention might allow unlimited durations one day

Permalink Reply

Mirrored from scored.co

TacosForTrump on scored.co

1 month ago 0 points (+0 / -0 ) 1 child

That's wild, give it 5 years though

Permalink Reply

Mirrored from scored.co

XBX_X on scored.co

1 month ago 1 point (+0 / -0 / ) 1 child

It's a totally wrong assumption. The 11-second limitation exists so that people don't make anything too wild or meaningful; as I mentioned, they're saving the full potential of this tech for high-dollar clients like ad agencies and film studios. Imagine if you discovered fire. Now imagine if you could limit who gets to use fire and how much of it. You want people to know fire exists, and it's benefits, but you also want to make it scarce so that you can demand big bucks for it too.

That's what's happening with AI.

Permalink Reply

Mirrored from scored.co

-2

part on scored.co

1 month ago -2 points (+0 / -0 / )

you are not correct, as of 2 weeks ago.
for coherence all prior data is used for each frame... worse... its not linear.

Making long, good-looking videos with video diffusion, especially using next-frame prediction models, is tricky due to two core challenges: forgetting and drifting.

Forgetting occurs when the model fails to maintain long-range temporal consistency, losing details from earlier in the video.

Drifting, also known as exposure bias, is the gradual degradation of visual quality as initial errors in one frame propagate and accumulate across subsequent frames.

11 sec uses a shitload of RAM

one new hack is "FramePack"

May 2025 FramePack:
======

FramePack is a Neural Network structure that introduces a novel anti-forgetting memory structure alongside sophisticated anti-drifting sampling methods to address the persistent challenges of forgetting and drifting in video synthesis. This combination provides a more robust and computationally tractable path towards high-quality, long-form video generation.

The central idea of FramePack’s approach to the forgetting problem is progressive compression of input frames based on their relative importance. The architecture ensures that the total transformer context length converges to a fixed upper bound, irrespective of the video’s duration. This pivotal feature allows the model to encode substantially more historical context without an escalating computational bottleneck, facilitating anti-forgetting directly.

FramePack system is built around Diffusion Transformers (DiTs) that generate a section of S unknown video frames, conditioned on T preceding input frames. It does not allow camera movement yet.

https://github.com/lllyasviel/FramePack

https://lllyasviel.github.io/frame_pack_gitpage/

TL/DR : XBX_X YOU ARE WRONG

Permalink Reply

Mirrored from scored.co

XBX_X on scored.co

1 month ago 0 points (+0 / -0 ) 1 child

No, each added second does NOT double the needed VRAM required. Don't assume that the video needs to made concurrently. At the end of the day, a digital video is just a slide-show of still images. The 11-second limitation exists so that people don't make anything too wild or meaningful; as I mentioned, they're saving the full potential of this tech for high-dollar clients like ad agencies and film studios.

Permalink Reply

Mirrored from scored.co

-2

part on scored.co

1 month ago -2 points (+0 / -0 / )

you are not correct, as of 2 weeks ago.

RAM limits clip size

for coherence all prior data is used for each frame... worse... its not linear.

Making long, good-looking videos with video diffusion, especially using next-frame prediction models, is tricky due to two core challenges: forgetting and drifting.

Forgetting occurs when the model fails to maintain long-range temporal consistency, losing details from earlier in the video.

Drifting, also known as exposure bias, is the gradual degradation of visual quality as initial errors in one frame propagate and accumulate across subsequent frames.

11 sec uses a shitload of RAM

one new hack is "FramePack"

https://arxiv.org/pdf/2504.12626

May 2025 FramePack:
======

FramePack is a Neural Network structure that introduces a novel anti-forgetting memory structure alongside sophisticated anti-drifting sampling methods to address the persistent challenges of forgetting and drifting in video synthesis. This combination provides a more robust and computationally tractable path towards high-quality, long-form video generation.

The central idea of FramePack’s approach to the forgetting problem is progressive compression of input frames based on their relative importance. The architecture ensures that the total transformer context length converges to a fixed upper bound, irrespective of the video’s duration. This pivotal feature allows the model to encode substantially more historical context without an escalating computational bottleneck, facilitating anti-forgetting directly.

FramePack system is built around Diffusion Transformers (DiTs) that generate a section of S unknown video frames, conditioned on T preceding input frames. It does not allow camera movement yet.

https://github.com/lllyasviel/FramePack

https://lllyasviel.github.io/frame_pack_gitpage/

TL/DR : XBX_X YOU ARE WRONG

Permalink Reply

Mirrored from scored.co