Date: Fri, 30 Apr 1993 19:43:34 -0400 From: tyetiser@umbc.edu (Mr. Tarkan Yetiser) Subject: Polymorphic Viruses Polymorphic Viruses: Implementation, Detection, and Protection Copyright (c) 1993 by VDS Advanced Research Group P.O. Box 9393 Baltimore, MD 21228, U.S.A. prepared by Tarkan Yetiser e-mail: tyetiser@umbc5.umbc.edu Jan 24, 1993 PA, U.S.A. Summary This paper discusses the subject of polymorphic engines and viruses. It looks at general characteristics of polymorphism as currently implemented. It tries to maintain a practical presentation of the subject matter rather than an academic and abstract approach that would confuse many people. Basic knowledge of the Intel 80x86 instruction set will be highly useful in understanding the material presented. A very detailed discussion is avoided not to have the side effect of "teaching" how to create polymorphic engines or viruses. The purpose is to help computer professionals understand this trend of virus development and the threats it poses. It should serve as a starting point for individuals who would like to get an idea about the polymorphic viruses and how they are implemented. Long gone are the days of innocence, when any schoolboy could write a virus scanner using a few signatures extracted from captured virus samples. The subject of polymorphism can be extended to other areas such as anti-reverse-engineering or anti-direct-attacks, and it can be argued to be useful in that context. This paper only looks at the use of polymorphism in PC viruses to avoid simple detection techniques. 1. Introduction In the Spring of 1992, we analyzed a polymorphic engine called MtE, and provided a report to satisfy the curiosity of the public. We also provided a freeware program called CatchMtE. At the end of 1992, we received reports of a new polymorphic engine, called TridenT Polymorphic Engine, or TPE for short. It was released in Europe by someone who calls himself Masud Khafir. Both MtE and TPE are distributed as object modules that can be linked to programs allowing them to create different-looking decryptors. One notorious use of such a module is for writing so-called "polymorphic" viruses. Prior to the appearance of TPE, several viruses using MtE (Mutation Engine) have been seen. The claimed author of TPE pays tribute to the author of MtE in the documentation that comes with TPE. MtE is about 2.4 kilobytes in size. It can generate decyptors using based or indexed addressing modes with word-size displacement. The decryptor steps through the code a word at a time. It uses 4 variations of the based or indexed addressing modes. The structure of the decryptors is constant. TPE is about 1.5 kilobytes in size. It can generate decryptors using based or indexed addressing modes with or without displacement. Unlike MtE, TPE can also create byte-at-a-time decryptors, as well as word-at-a-time. It also uses more addressing modes available on the 80x86; 6 variations of the based or indexed addressing modes are used. Its more general nature makes TPE less predictable, and complicates the task of recognizing TPE-based viruses. Many encryptive viruses can be considered a subset of TPE-based decryptors, and may be flagged as such. To overcome this problem, one has to check for other viruses before performing check for the presence of a TPE decryptor. 2. Polymorphism and Its Common Use We will now present some preliminary information on polymorphism in general, and discuss certain features of the Intel 80x86 instruction set. Although polymorphism is independent of encryption, it is easier to use encryption to hide the main body of the virus and implement a polymorphic decryptor. Viruses aim to keep their size as small as possible and it is impractical to make the main virus body polymorphic. One could attempt to rearrange the instructions in the main virus body or even use different instructions (to defy recognition techniques based on checksumming). Such an effort would not be as helpful or as easy as the common approach of using encryption in combination with polymorphism. In summary, current polymorphic viruses keep the main virus body encrypted, and implement a polymorphic decryption routine in plaintext. Since the decryptor is comparatively small, and performs one specific task, the amount of time and effort needed to craft a polymorphic virus is significantly reduced. Pictorially, a generic polymorphic virus would be structured as follows: virus entry point: Polymorphic Decryptor ***************************** ** encrypted **************** ** main virus body ********** ***************************** ***************************** 3. Implementation of Polymorphism on the Intel 80x86 In almost every case we have examined, the polymorphic engine exploits the fact that certain computations can be performed using different registers and instructions. To step through encrypted portion of code, for example, one can use DI, SI, or BX registers. To increment or to decrement the index value, one can ADD to the index register, INC it, or use an implicit instruction that increments it (CMPSB is used in TPE for example). Polymorphic engines can also rely on the availability of instructions that are coded using the same opcodes. On the 80x86, there are 11 opcodes used for several different instructions: 80, 81, 83, D0, D1, D2, D3, F6, F7, FE, and FF. When it is necessary to encode information about one operand, the middle three bits of the ModRM byte are used to distinguish operations. The ModRM byte follows some opcodes and can either extend the opcode or contain information about the operand and addressing modes of an instruction. It contains three fields: Mod (2 bits), Reg (3 bits), and R/M (3 bits). For example, for the opcode 80, the middle three bits of the ModRM byte, which make up the Reg field, have the following meanings: ADD 000 OR 001 ADC 010 SBB 011 AND 100 SUB 101 XOR 110 CMP 111 The second operand of each instruction with an opcode of 80 is an immediate byte, so the ModRM fields in the second byte of machine code encode the first register or memory operand. By flipping a few bits, it is possible to generate code achieving many different operations easily. Combined with a rich set of addressing modes, and a good random number generator, one can create very different-looking decryptors. 4. Common Characteristics of Polymorphic Viruses Polymorphic viruses of varying degrees of complexity have appeared in the past. In the anti-virus community, polymorphism is still an active area of discussion and much disagreement. Most researchers would agree that certain viruses are simply encryptive with a variable key. We do not consider such beasts polymorphic since they can be recognized using simple wild card scan strings. The number of such viruses significantly increased over the past few years as a reaction to the worldwide availability of good signature-based virus scanners. Similarly, scanners also evolved to handle such beasts using more flexible signatures that can include wild card or don't-care values to accommodate the variable parts of the decryptors. There are other viruses that exhibit some polymorphic behavior, though they remain unsophisticated. Classification of such beasts is a gray area. For example, some viruses can generate a limited number of different-looking decryptors. These do not implement a polymorphic code generation engine, but rather pick from a pre-computed set of code fragments that they carry along. Writing detection routines for such viruses is not too difficult. To aid in implementing polymorphism, a random-number generator is often used. A random-number generator provides selection of a part of the decryptor. This selection could be for "noise" instructions that are inserted in the decryptor to render signature-based scanning ineffective. It could also be used to select a certain addressing mode and appropriate registers. The obvious use of random number generators is to get a seed value for encryption. Another aspect of a polymorphic engine is the choice of instructions that actually perform the decryption. XOR is a favorite choice. The chosen operation modifies the code to be decrypted using an addressing mode that allows external memory access. By external, we mean outside the processor registers. By using several instructions and many different addressing modes, a polymorphic engine can achieve a large number of combinations of decryptors. Usually, a loop construct is set up to step through the encrypted code. On the 80x86, many instructions can be used to accomplish looping. The loop counter can be modified implicitly using LOOP instruction. Or it could be more explicit using a DEC CX, JNZ ?? combination. Several such possibilities exist. Again, availability of different approaches increases the number of appearances a decryptor can have. In the MtE, the loop construct was very predictable not only because it used a specific instruction but also because it had a characteristic that made it very simple to recognize MtE-based decryptors. Although "appearance-based" analysis on the MtE left many anti-virus developers in despair, a structural analysis proved to be very effective. We doubt even the developer of MtE noticed how predictable his polymorphic engine is. The goal of a polymorphic engine is not to leave any predictable sequence of instructions that a virus scanner can use to simplify or optimize an appropriate detection algorithm. For example, using the same instruction to increment the index register is too predictable. A mixture of instructions that achieve the same result explicitly or implicitly (as in TPE) would make detection a lot harder. 5. Detection of Polymorphic Viruses There are several problems that must be addressed to develop a reliable detection routine for a given polymorphic engine. The most challenging one appears to be avoiding false positives. Many programs include a run-time decompressor to reduce space requirements. When loaded in memory, the decompressor takes charge and produces an expanded copy of the code that can be run on the CPU. Furthermore, some programs are actually encrypted on disk, and they are decrypted on the fly when they are loaded. The purpose of using such a scheme is to make reverse-engineering efforts difficult. Such programs are more likely to trigger a false positive. To reduce false positive rate, one can limit the search to a small area of the suspect file. This would be helpful in many cases; however, the compressed and encrypted files carry their decompressor/decryptor at the entry point of the program, just where many viruses plant their code. Another trick a few viruses employ is to hide the virus entry point where the decryptor is located. Of course, scattering the virus code while keeping the host functional complicates the virus. A structural analysis of TPE reveals some information that can be used to recognize a TPE-based decryptor, although it is not as useful as it is in the case of MtE. Note that structural analysis does not rely on presence of specific byte sequences, and therefore offers a powerful tool that can be used in developing recognition routine for many polymorphic engines. 6. Protection Mechanisms and Solutions Anti-virus field is full of solutions to almost every existing virus problem, and sometimes solutions to non-existing problems as well. Among other things, one solution prevailed as the most popular: scanning. The inherent flaw in this solution is that it cannot cope with viruses that it does not have signatures for. Therefore ugrades are issued frequently. Deployment of such upgrades in large installations requires tremendous effort, and quickly translates into having a very old version of a scanner installed; even then not necessarily used. Aside from the false sense of security scanning gives, some companies make less-than-honest claims about the capabilities of their scanners to promote their products. Testing such claims is beyond the resources of most organizations. Those who could conduct such tests are not always impartial. Scanning has its place, and it is very useful when used properly. The best protection against computer viruses is user awareness and education. Unfortunately, a well-written virus can be so transparent that even the most observant user may not notice a thing. Even worse, many users lack the technical knowledge to understand how their computers work. Most people simply do not have the time or desire to learn more than how to use a mouse. The proof of this is the proliferation of products that make excessive hardware demands without offering improved functionality. The bulk of the code takes care of the user interface. This trend is unlikely to change. More powerful techniques such as integrity checking can deal with viruses of different kinds, including polymorphic viruses. Even a simple integrity checker should be able to find if a polymorphic virus is spreading all over your disk. In other words, solution to polymorphic viral spread has been available for quite some time. Note that there are some viruses that target integrity-based products and they must be dealt with accordingly. A detailed discussion of such viral methods can be found in a paper written by anti-virus researcher Mr. Vesselin Bontchev of Virus Test Center at the University of Hamburg. His paper is titled Attacks Against Integrity Checkers, and it is available via anonymous FTP. Interested individuals are encouraged to read his excellent paper. The point is that an integrity-based anti-viral solution should be made part of your arsenal against viruses. Such a solution could easily provide you with early detection of viruses before they spread. Once the viral spread is controlled, viruses become nothing more than another "computer glitch" that can haunt only those without timely backups. We believe that neither harsh legislation nor emphasis on responsible computing can stop virus development, although they may slow it down. It is necessary to take matters into your own hands and protect your computers adequately.