The simplest form of visual cryptography use transparencies that each individually convey no recognizable information, but reveal a meaningful image when precisely aligned

The basic approach for doing the same with video is using such visual cryptography to encipher each frame. That works, but (as stated in the question) requires generating two images for each frame, and playing these in perfect synchronization. Also, the decoding won't be easy: the eye is good at averaging luminous intensity, and projection on an ordinary screen, even if perfertly superimposed, will lack contrast. Further, lossy video compression (as most video compression schemes are) will tend to prevent decoding.
It is tempting to make the image in one of the two video streams still, which in particular allows decoding with fair contrast simply by projecting on a screen that serves as key. But that's very insecure! What's moving in the original video will be distinguishable to the naked eye; and XORing each frame with the previous one will make that even more apparent: the XOR of two consecutive frame in the ciphertext is also the XOR of two consecutive frame in the plaintext, and is very revealing; in particular, that shows the left and right contour of something moving horizontally.
I see no secure fix.