USE A BUFFERING FIFO QUEUE TO OUTPUT YOUR GFX -------------------------------------------------------------------------- This little text was originally placed in the Imphobia MAG 12. I've added some little things and I want it to be available for every programmers and not only sceners. I hope you'll like it and I encourage you to D/L the Imphobia MAGs (great MAG, Darky !!!) available on ftp.cdrom.com /pub/demos and ftp.arosnet.se to read other amazing programming tricks and more... -------------------------------------------------------------------------- Maybe this article has no interest for you, but i think it's important to be clear about how to display frames on your screen in a clean manner. There are many demos done by respectable groups, full of nice 3D algorithms, nice distos, ... which suffer of a bad display of the frames (ugly horizontal cuts in the frames). The problem is similar with many games (except DOOM which is perfect in this domain ;-) ). In the good old time of Mode-X, there weren't such problems because everyone was using the multiple pages offered by this scheme of adressing, in order to get a perfect double (or triple) buffered display. I know that many of you will argue that ModeX is slow, that now we have PCI/VLB adapters wich run damn fast in Mode 13h, and so why to support Mode-X again ... The reply is simple: i don't say that we must support Mode-X... no ! We have a better tool now, called UNIVBE 5.1, which offer many extended modes with multiple pages (unlike Mode 13h). Others will argue: pffff... i don't want to get my 3D engine idle loosing time to synchronise with the VGA display and to support double-buffering. The reply is : It is possible to get your engine 100% efficient and to get a perfect synchronised display, without "ugly cuts" ,... Even without using UniVBE, if you still want to use your favourite Mode 13h and you have at least a VLB/PCI adapter, there is a way to get a single-buffered display without cuts (no, Wizard, it's not with a HBL handler ;-)), and without idle graphics. ============= 1: THE BASICS ============= I just put here some lines grabbed from my Open GL user guide, and some personnal comments: In a movie, motion's achieved by taking a sequence of pictures (24 per sec), and then projecting them at 24/s on the screen. In Computer Graphics, screens typically refresh (redraw the picture) approximately 60 to 76/s, and some even run at about 120 refreshes/sec. The usual video modes used in demos and games have 70,60, and eventually 50 Hz refresh. The 'key' idea that makes motion picture projection work is that when it is displayed, EACH FRAME IS COMPLETED. This is NOT the case if you fill the video page during its display (single-buffering), because you alter the screen while the video processor decodes it !!!. So the video processor decodes a mix of your old and your new frame, and you get those ugly cuts. You can say that a "rep movsd" or "rep stosd" is faster than the decoding of video memory on your PCI/VLB adapter in Mode 13h, and it is impossible to "see" the update of the screen. Right, if you work with a small video mode (like 64 Kb video mem update) and if you are SYNCHRONISED with the screen DURING the modification. Now, suppose that you want to display your million-frame movie with a program like this (we suppose the screen refreshes at 70Hz, ie. Mode 13h): init_gfxmode(); for (i = 0; i < 1000000; i++) { clear_screen(); draw_frame(i); SYNC / wait_until_a_70th_of_a_second_is_over(); i = i + number of frames to skip to get a constant speed on every machine; } If you add the time taken by your system to clear the screen and to draw a typical frame, this program gives more and more disturbing results depending on how close to 1/70 second it takes to clear and draw. Suppose the drawing takes nearly a full 1/70 second. Items drawn first are visible for the full 1/70 second and present a solid image on the screen; items drawn toward the end are instantly cleared as the programs starts on the next frame, so they present at best a ghostlike image, since for most of the 1/70 second your eye is viewing the cleared background instead of the items that were enough unlucky to be drawn last. The problem is that this program doesn't display completely drawn frames; instead, you watch the drawing as it happens. ********************* 0 solid scanlines ********************* . ********************* . --------------------- . ghost scanlines !!! --------------------- 199 There are many solutions to this problem: A) WORK IN A BUFFER IN CENTRAL MEMORY, AND THEN COPY THIS BUFFER TO THE SCREEN (REP MOVSD) The code becomes: init_gfxmode(); for (i = 0; i < 1000000; i++) { copy my buff to video; clear my buff draw_frame(i) in my buff; SYNC / wait_until_a_70th_of_a_second_is_over(); i = i + number of frames to skip to get a constant speed on every machine; } This is better ... But if you work in a Hi-res GFX mode (ex. 640x400x16M => 768k or 1M), may be the copy will take more than 1/70, and you will see a piece of the previous frame at the bottom of the screen (instead of the ghost frame described before). This is not esthetic. Moreover you will need to work in central mem and then copy to video mem, this is far to be optimal when we think that the recent Video cards have a flat linear display in which we can work directly. ********************* 0 portion of frame i ********************* . ********************* . frontier --------------------- . portion of frame i-1 --------------------- 199 The frontier is +- constant thanks to the SYNC (assuming the calculation of a frame take a constant time... this can be true for plasmas, but not for 3D). But concretely, this is worse because the SYNC line is often removed because it takes time we could use for the calculation of effects. In this case the code becomes: init_gfxmode(); for (i = 0; i < 1000000; i++) { copy my buff to video; clear my buff draw_frame(i) in my buff; i = i + number of frames to skip to get a constant speed on every machine; } This means that there is no more synchronisation with the retrace of the screen. In this case the frontier between the old (at the bottom) and the new frame will move on the screen, and this is REALLY ugly and visible. i ****************** i+1 ******************* i+2 ****************** ****************** ******************* ****************** ****************** i ------------------- ****************** i-1 ------------------ ------------------- ****************** ------------------ ------------------- i+1 ------------------ AND EVEN : (!) i+2 ------------------ If you look carefully many 3D phong, mapped, ------------------ bumped demos, you will often see such things. i+3 ****************** (Look them on a 486 DX2-66, Fast Pentiums can ****************** false the results on demos designed for 486). ****************** So, this is NOT the right thing to do if you want a quality animation. B) USE DOUBLE BUFFERING ... =================== 2: DOUBLE BUFFERING =================== Double buffering is a radical way to remove the problems described before. The idea is to have 2 video pages, one is displayed while the other is being drawn. When the drawing of a frame is complete, the two buffers are swapped. So the one that was being viewed is now used for drawing, and vice versa. It's like a movie projector with only two frames in a loop; while one is being projected on the screen, an artist is desperately erasing and redrawing the frame that is not visible. As long as the artist is quick enough, the viewer notices no difference between this setup and one where all frames are already drawn and the projector is simply displaying them one after the other. With double-bufering, every frame is shown only when the drawing is complete ; the viewer never sees a partially drawn frame. This is a sample of code: init_gfxmode(); j = 0; k = 1; SYNC SetVisualPage(j) for (i = 0; i < 1000000; i++) { clear page(k) draw_frame(i) in page(k); SYNC / wait_until_a_70th_of_a_second_is_over(); k <=> j SetVisualPage(j) // must wait the vertical retrace to set new values to // the video registers i = i + number of frames to skip to get a constant speed on every machine; } The benefits are: - You can work directly in video mem and use the possibility of FLAT linear adressing. - It is impossible to have interferences between the new and the old frames. - Because you are working directly in video memory, you can even use the BitBLT accelerator of your card to "clear page(k)" or to set a nice background, or to draw lines, sprites, ... (There are very few cards which have a BitBLT able to work in central memory, even if you just want to specify a source in central mem... so in the single buffering scheme, the copy buff to screen must be done by hand :-( ). With the new UniVBE 5.2 and the VBE/AI, BitBLT will be a reality !!! Think about that !!! I'll speak about BitBLT in a next article, both customs routines and VBE/AI support... The prob: - You MUST be synchronised with the screen !!! So your graphics engine is idle until the vertical retrace is done, and that is time lost for calculation. With the SYNC line, you wait until the current screen refresh period is over so that the previous buffer is completely displayed. Assuming that your system refreshes the display 70 times per second, this means that the fastest frame rate you can achieve is 70 frames per second, and if all your frames can be cleared and drawn in under 1/70 second, your animation will run smoothly at that rate. What often happens on such a system is that the frame is too complicated to draw in 1/70 second, so each frame is displayed more than once. If, for example, it takes 1/45 second to draw a frame, you get 35 frames per second, and the graphics are idle for 1/35-1/45=1/157 second per frame. Altough 1/157 second of wasted time might not sound bad, it's wasted each 1/35 second, so actually more than 1/5 of the time is wasted. That means that if you're writing an application and gradually adding features, at first each feature you add has no effect on the overall performance - you still get 70 frames per second. Then, all of a sudden, you add one new feature, and your performance is cut in half because the system can't quite draw the whole thing in 1/70 of a second. A similar thing happens when the drawing time per frame is more than 1/35 second - the performance drops to 35 to 23 frames per second, and so on (70/1, 70/2, 70/3, 70/4, 70/5, ...). ==================================== 3: N-BUFFERING / THE BUFFERING QUEUE ==================================== How to get cuts-free animation without idle graphics ? The idea is to think in a different manner the couple CPU/Video. We can see this as the classical problem of producer/consummer: here the CPU produces frames and the Video consummes them in parallel. The CPU produces the frames as fast as it cans, and the Video consumes the frames at its own independant rate (ex. 70 frames/s). The frames produced by the CPU are placed in a FIFO Queue which feeds the Video. FIFO Queue (N entries max) --------------------------------- CPU -> * * * * -> Video --------------------------------- If the FIFO queue is full (the N entries are filled), then the CPU enters in a idle loop until there is some place free to put the new frame it has calculated. If the FIFO queue is empty, the Video will keep the old frame displayed, and look in the FIFO at the next refresh. For N=2, we have the double-buffering described before. N=3, we have triple-buffering which is often satisfactory, because it breaks yet the rigid synchronism we had with double-buffering, without using many buffers (3). ID Software have used triple- buffering in their game DOOM, which work in Mode-X (which gives 3 pages 320x240x256 or 4 pages 320x200x256). N=4, ... . . The more the buffers, the more the CPU can anticipates frames and avoid to enter in a idle loop. Concretely, we can bufferize the start-adresses of the video pages we are working on. In this case, we have a code like that: init_gfxmode(); install_interrupt_handler(); // CPU (Producer) // Interrupt handler (Consummer) (Handler called at each j = 1; Vertical retrace) InQ(0); for (i = 0; i < 1000000; i++) { if (EmptyQ() == false) clear page(j); // use BitBLT { draw_frame(i) in page(j); new_start = OutQ(); while (FullQ() == true) {}; // idle loop SetVisualPage(new_start); InQ(j); } j = (j + 1) MOD N; iret i = i + number of frames to skip to get a constant speed on every machine; } Yep, that's quite cool, uh ??? Note: to do an interrupt handler synchronized with a refresh of 70Hz, you just have to reprogram the PC timer to a clock a bit faster like 75 Hz, wait for the VR bit in 3DAh (resynchronisation) and restart the timer...(this is called a semi-active wait). There are many VR-Handler available on FTP sites or BBSes (look for example at the Starport BBS intro source code, ... ). Well, this code work fine if you have a multipage display ... This is not a problem for SVGA modes: if we consider a 1M board, which is the actual standard, we have 8 pages in 320x200x65K, 16 pages 320x200x256, 4 pages 640x400x256, 3 pages 640x480x256, ... (at least if you use UniVBE). But 320x200x256 16 pages doesn't work on all cards, and so the good old Mode 13h has still a reason to exist. No problem, remember what i told before in "The basics", don't use (physical) synchronised single-buffering with an Hi- Resolution mode... Ok, but Mode 13h is a small mode which can be updated very fast on PCI/VLB cards (64k to fill). The idea is to use a Logical N-buffering combined with a synchronised Physical single buffering. In a synchronised Physical single buffering, we work in buffers in central memory, and then copy them into the video memory. So, we can imagine to bufferize the addresses of those buffers (we place the addresses in the FIFO), and then to have an interrupt handler (synchronised with the VR) which get the address of the new buffer to display (= to copy) and invoke a copy routine for this buffer. Warning !!! This invoquation can not be a simple call, because if the copy routine takes too much time, this can result in a total misfunctionning of the interrupt handler (Remember that such periodic interrupt handler is a critical code which has some real time constraints and which cannot miss an event !!!). In order to avoid problems, the copy rout must be interruptible by the handler. This is obtained if we invoke the copy rout using a context-switch: we pop the stack layers until we reach the return address of the code interrupted by the handler, and we insert the address of the copy-rout, and then we re-push the layers, and when we'll do an "iret" at the end of the handler, we'll jump to the copy-rout (which is interruptible by the handler, because it's seen as a normal user application), at the end of the copy rout, we do an iret to restore the code interrupted previously by the handler. STACK | Var 1 | | Var 1 | | Var 2 | insert Adr 2 | Var 2 | | Adr 1 | ---> | Adr 2 | | ..... | | Adr 1 | You don't have to forget that just before the return address, there is also the status flag, and you have to consider it when you push/pop if you don't want to obtain awesome crashes. Just refer you to your 8x86/80x86 manual to see how the instructions iret, iretd, ret, ... work. In particular, when you push the address (which is in the form segment:offset/selector:offset) of the Copy_rout, you must push a dummy flag, because it will be invoked by an iret/iretd (you just have to do a pushf/pushfd). The code becomes: init_gfxmode(13h); install_interrupt_handler(); // CPU (Producer) // Interrupt handler (Consummer) (Handler called at each j = 1; Vertical retrace) InQ(0); for (i = 0; i < 1000000; i++) { if (EmptyQ() == false) clear buffer(j); // use CPU { draw_frame(i) in buffer(j); Adr_Buffer = OutQ(); while (FullQ() == true) {}; // idle loop Pop all local variables; InQ(j); pushfd; j = (j + 1) MOD N; Push Adr of Copy_Rout; i = i + number of frames to RePush all local variables; skip to get a constant } speed on every machine; iretd; // this handler MUST // be short !!!! } Adr_Buffer: Integer; Copy_Rout: (assume ds/es -> 0) mov esi, Adr_Buffer mov ecx,16000 mov edi,0a0000h rep movsd iretd This is the idea... With this scheme of work, we get 100% efficient code and 100% synchronisation with the display. Moreover, there are many interesting properties of the buffering queue, but i let you imagine that ;-). Good luck with your implementation. ___________________________________ Greets to all my friends, all TFL-TDV members, all kewl guyz of the scene i got a nice chat with, and all guyz who will greet me (us) in the future ;-) I specially thanx Karma and Bismarck/TFL-TDV for inspirating me, and Karma for playing with the bugs during the implementation of a N-Bufferized Mode 13h for his Descent-like part in Hurtless ;-) (C) 1996 Type One / TFL-TDV Contact me at the following addresses: llardin@is1.ulb.ac.be Laurent Lardinois (until october 1996, after 271 chauss�e de Saint Job jusk ask jcardin@is1.ulb.ac.be 1180 Bruxelles, Belgium my new email) ������������������������������������������������������������������������������� The N-Buffering (up to 8 buffers used !) feature was implemented in the demo "HURTLESS" we presented at Wired 95. It featured 320x200/640x200 Hi-Color, 320x200x256 chained multipages, BitBLT, FLAT LINEAR, and Video RAM booster support, WITH or WITHOUT UniVBE. Have a look if you want to see the thing working (however the demo might be unstable because of the intensive use of Mikmod 2.03 virtual timers... but maybe we'll do a special release with a new version of Mikmod. The SB support is really random). It is available on ftp.cdrom.com, ftp.arosnet.se, and hagar.arts.kuleuven.ac.be . �������������������������������������������������������������������������������