💾 Archived View for clemat.is › saccophore › library › ezines › textfiles › ezines › APJ › apj_3.txt captured on 2021-12-03 at 14:04:38.
-=-=-=-=-=-=-
::/ \::::::. :/___\:::::::. /| \::::::::. :| _/\:::::::::. :| _|\ \::::::::::. Feb/March 98 :::\_____\::::::::::. Issue 3 ::::::::::::::::::::::......................................................... A S S E M B L Y P R O G R A M M I N G J O U R N A L http://asmjournal.freeservers.com asmjournal@mailcity.com T A B L E O F C O N T E N T S ---------------------------------------------------------------------- Introduction...................................................mammon_ "An Introduction to SPARC assembly"............................+Spath. "Extending NASM"...............................................mammon_ Column: Win32 Assembly Programming "NASM specific Win32 coding".......................Tamas Kaproncai "More about Text".........................................Iczelion "Keyboard Input"..........................................Iczelion Column: The C standard library in Assembly "C string functions: introduction, _strlen".................Xbios2 "C string functions: _strcpy"...............................Xbios2 Column: The Unix World "X-Windows in Assembly Language: Part II"..................mammon_ Column: Virtual Machines "An Intro to the Java Virtual Machine"............Cynical Pinnacle Column: Assembly Language Snippets "NumFactors"..........................................Troy Benoist Column: Issue Solution "6-byte Solution"..........................................mammon_ ---------------------------------------------------------------------- ++++++++++++++++++++++++Issue Challenge+++++++++++++++++++++ Write a routine for converting ASCII hex to binary in 6 bytes ---------------------------------------------------------------------- ____________________________________________________________________________ ___ .___ __) (__ _____ ______ ``` ._____| \____\ ___/__._) /._) _ (_. \\ | | _\ |_ \ | \/ | |CE , .=|_____|___)\___|(_______|______| |===============[ Introduction ]===. '================================| :==================================' : . by mammon_ The first thing that you will notice about this issue --well, that it is late-- will probably be the section headers designed by iCE. I had to add a top/upper left border to them [the horizontal and slanted lines] in order to make them standout when scrolling though a 100K file such as this one, but other than they are all his: comments, etc welcome. I don't have much to say about this issue: I went overboard with the NASM stuff this month as I have been doing a lot of 'research' work in that area recently; my articles have been supplemented with Tamas Kaproncai's Win32 NASM pointers. Iczelion and XBios2 have both produced --as usual-- 2 quality articles this month, Iczelion's based on his win32 asm tutorial 'the MASM way', and XBios2 once again continuing to replace C with assembler. +Spath. has produced an excellent article on SPARC assembly language; I was hoping to debut the 'other CPU' scene with a MIPS article I had planned but it looks like +Spath has beat me to it. On a similar note, I mentioned on the Message Board wanting to start a Virtual Machines column. Cynical Pinnacle has started the column off this month with an article on programming the Java VM in its native 'assembly language'; in subsequent issues I and perhaps others will be adding articles here as well. A final note, I have not come up with a challenge for the next issue; anyone with good ideas is welcome to post one to the Message Board or to the APJ email address. Enjoy the mag! _m ::/ \::::::. :/___\:::::::. /| \::::::::. :| _/\:::::::::. :| _|\ \::::::::::. :::\_____\:::::::::::...........................................FEATURE.ARTICLE An introduction to SPARC assembly by +Spath. The goal of this article is to introduce SPARC v8 architecture and SPARC assembly ; I hope it can also constitute a good introduction to RISC philosophy. What is SPARC ? ----------------- The principles of RISC (Reduced Instructions Set Computer) are born in the early 80's in two universities (Berkeley and Stanford) ; its philosophy is the quest for simplicity and CPU speed. SPARC (Scalable Processor ARChitecture) is a 32 bits RISC architecture created by Sun in 1987. It's an open architecture, so that any manufaturer can make SPARC processors (like Philips, VLSI, T.I., Fujitsu... already did). Its key features are : - a load/store architecture : this means that only registers can be used in data manipulation operations, and not memory locations. Memory is organised in a linear address space of 2^32 bits which use "big-endian" organisation (the MSB is stored first) ; a word is 32 bits wide (a 16 bits data is a halfword). - a large number of registers : from 2 to 32 sets of 24 general purpose registers are available ; these 24 registers are local registers %l[0-7], in registers %i[0-7] and out registers %o[0-7], all working in an overlapping windows mechanism that will be explained later. The SPARC architecture also provides 8 global registers %g[0-7], 32 registers for floating-point operations (%f[0-31]) and some specific registers (%pc, %sp, %psr, %y,...). - a small set of simple instructions : to avoid translation from machine code to microcode, SPARC instructions are directly implemented in hardware, and therefore are very basic (mainly load/store, logical, arithmetic, branching). All instructions are 4 bytes long, and most of them use 3 registers (source1, source2, destination in that order). Assemblers also provide a set of synthetic instructions, which are more "coder friendly", but does not really exist for the processor (and therefore must be carefully used). These synthetic instructions have most of the time less operands, so that the corresponding real instructions often use %g0, a read-only register stuck at 0 ; here are some aliases : synthetic instruction | real opcode nop <=> sethi 0, %g0 ret <=> jmpl %i7+8, %g0 mov reg_or_imm, reg <=> or %g0, reg_or_imm, reg cmp reg, reg_or_imm <=> subcc reg, reg_or_imm, %g0 Enough with theory, let's see some code. SPARC assembly basics ----------------------- Let's start with an in-season "hello world" style program : '!' is used for single line comments, /* .. */ is used for multiple lines comments). !8<------------------------------------------------------------------------ /* FILE : hello.s */ .section ".rodata" ! read-only initialised datas .MyText: ! define our string label .asciz "Happy new year %i \n"! define a null-terminated string .Year: .word 1999 ! define a word constant .section ".text" ! read-only object code (instructions) .global main ! Make function name globally visible main: save %sp, -112, %sp ! allocate space for stack sethi %hi(.MyText), %o1 ! load higher part of string offset or %o1, %lo(.MyText), %o0 ! add lower part of offset set (.Year), %l1 ! get year address ld [%l1], %o1 ! load year into %o1 call printf ! print the string nop ! do nothing (BDS) ret ! Return to caller restore ! Restore register windows (BDS) Endmain: ! Tell the linker how big the .size main,(.-main) ! procedure is ("." is current address). !8<------------------------------------------------------------------------- Every procedure must save some memory space for itself ; this stack space will be used to store the out and local registers and all the datas needed by the procedure (the minimal space is 64 bytes for %o and %l registers). The stack grows from higher to lower addresses, so that allocating a stack space is implemented by substracting a value from the current stack pointer ; the previous stack pointer is called the frame pointer (%fp). Registers %o0 - %o5 are used to pass the first six parameters to a procedure, because the current stack pointer (%sp) is stored in %o6 and the calling program counter (%pc, used to calculate the return address) is stored in %o7. If a procedure has more than six parameters, the remaining parameters are passed using the stack space (eg for a caller's stack space of 92 bytes, the child procedure can get the seventh parameter at [%fp+92]). As I said before, all instructions are 32 bits long, so that you must use two steps (with sethi and or) to load a 32 bits data. Note that %hi refers to the most significant 22 bits and %lo refers to the least significant 10 bits of a register. Like most RISC machines, a SPARC processor uses a branch delay slot (BDS) to optimize pipeline efficiency : this means that by default, the instruction following a branching is executed regardless of whether or not the branch is taken. So the coder must move (when possible) an instruction from before the branch to after the branch. Another possibility is to use the 'nop' instruction or to add the ',a' suffix to the branch instruction, which annul the next operation. Calling and branching ----------------------- Let's take another example to better illustrate the calling process : this is a recursive implementation of the Fibonacci numbers, which are defined as : fib(N) = fib(N-1) + fib(N-2) fib(0) = fib(1) = 1 !8<-------------------------------------------------------------------------- /* FILE: fib.s */ .section ".rodata" ! read-only initialised datas (constants) .align 8 ! datas must be double-words aligned .MyText: ! define our string label .asciz "Fib(%i) = %i \n" ! define a null-terminated string ! ------- FIB : handles F(0) and F(1) -------- .section ".text" ! read-only object code (instructions) .align 4 ! code must be word-aligned (4 bytes) .global fib ! Make function name globally visible fib: save %sp, -112, %sp ! save stack space mov %i0, %o0 ! 1st parameter may be needed for calling cmp %o0, 1 ! asked for F(0) of F(1) ? ble F1orF0 ! yes : take the branch mov 1, %i0 ! return value = 1 (BDS) call fibcall ! nop ! do nothing (BDS) mov %o0, %i0 ! return value = fibcall return value F1orF0: ret ! Return to caller restore ! Restore register windows (BDS) Endfib: .size fib,(.-fib) ! ----- FIBCALL : calls F(N-1) and F(N-2) ----- .global fibcall ! Make function name globally visible fibcall: save %sp,-112,%sp ! save stack space mov %i0,%l0 ! save N in %l0 call fib ! call F(N-1) sub %l0,1,%o0 ! compute N-1 (BDS) mov %o0,%i0 ! save result in %i0 call fib ! call F(N-2) sub %l0,2,%o0 ! compute N-2 (BDS) ret ! Return to caller restore %i0,%o0,%o0 ! return F(N-1) + F(N-2) (BDS) Endfibcall: .size fibcall,(.-fibcall) !-------- MAIN -------------------------------- .global main ! Make function name globally visible main: save %sp,-112,%sp ! save stack space call fibcall ! calculate fib number 7 mov 7,%o0 ! (BDS) mov %o0,%o2 ! result is second parameter sethi %hi(.MyText),%o0 ! load higher part of string offset or %o0,%lo(.MyText),%o0 ! add lower part of offset call printf mov 7,%o1 ! number is the first parameter (BDS) ret ! Return to caller restore ! Restore register windows (BDS) Endmain: .size main,(.-main) !8<-------------------------------------------------------------------------- All procedures share the global registers (%g[0-7]), the remaining registers %l[0-7], %i[0-7], %o[0-7] constitute the register window. When a procedure starts execution, it allocates 16 registers (input and local), the output registers are overlapped with the subroutine's input registers. Here's what happens if procedure A calls procedure B which calls procedure C : proc A in | local | out | | | proc B | in | local | out | | | proc C | in | local | out As you see, for each procedure, the parameters passed to and received from a subroutine are stored in %o registers. The same way, the parameters taken from and passed to the calling procedure are stored in %i registers. The current window pointer (CWP) identifies the current register window : it is stored in the least significant 5 bits of %psr, and is modified by the 'save' and 'restore' commands. The condition code register (another part of %psr) contains four flags : Z (zero), N (negative), C (carry), and V (overflow) ; contrary to x86 assembly, these bits are not updated by standard arithmetic operations, but by special instructions (with 'cc' suffix, like 'cmp' which is in fact a 'subcc'). For instance, you have these equivalences : SPARC x86 subcc r1, r2, r1 <=> sub r1, r2 ; keep result and flags subcc r1, r2, %g0 <=> cmp r1, r2 ; discard result, keep flags sub r1, 0, r2 <=> mov r2, r1 ; keep result, discard flags As last example, here's a simple anti-cracking method : a checksum on our own code : !8<----------------------------------------------------------------------- /* FILE: cksum.s */ .section ".data" ! read-only initialised datas (constants) .align 8 ! datas must be double-words aligned .CRCError: .asciz "Wrong CRC !! \n" ! .section ".text" ! read-only object code (instructions) .align 4 ! code must be word-aligned (4 bytes) cksum: save %sp,-64,%sp ! save minimal stack space mov %i0, %l0 ! %l0 is the base address sub %i1, %i0, %l1 ! %l1 is the decreasing index mov %g0, %l2 ! %l2 is the running sum loop: ld [%l0+%l1], %o0 ! fetch the next element add %l2, %o0, %l2 ! add it to the running sum subcc %l1, 4, %l1 ! one fewer element bge,a loop ! if %o0 >= 0 get next element ! (delay slot result is annulled) mov %l2, %o0 ! store the result in sum (BDS) ret ! Return to caller restore ! Restore register windows (BDS) endcksum: ! Tell the linker how big the .size cksum,(.-cksum) ! procedure is. !-------------------- MAIN ------------------------- .global main main: save %sp,-64,%sp ! allocate space for stack set main, %o0 ! start address for cksum sethi %hi(EndOfCRCZone),%o1 ! high part of end address call cksum ! calculate cksum or %o1,%lo(EndOfCRCZone),%o1 ! low part of end address (BDS) EndOfCRCZone: set 0x10954, %o1 ! load precalculated checksum cmp %o0, %o1 ! is checksum correct ? be End ! yes : exit nop ! do nothing (BDS) Error: ! no : display message sethi %hi(.CRCError ),%o0 ! load higher part of string offset call printf ! print the string or %o0,%lo(.CRCError),%o0 ! add lower part of string offset (BDS) End: ret ! Return to caller restore ! Restore register windows (BDS) EndMain: ! Tell the linker how big the .size main,(.-main) ! procedure is. !8<----------------------------------------------------------------------- Tools and references ---------------------- Here are the tools I use when I play with SPARC assembly ; some are SunOS specific tools, some are multi-platforms ones. The code you read here has been tested on various Sun workstations (using SPARC and UltraSPARC processors). With very little modifications, it also worked on ISEM, a SPARC emulator for Linux (see below). - assembling : I use gcc for that job, which itself uses as and ld (assembler and linker) to create ELF executables. CC and cc also work well ; these 3 compilers can also be used to generate ASM source code from C source code with the "-S" option, which is IMHO a great method to learn assembly on a new platform. - debugging : I use adb, which is very basic but also very powerful, but gdb and dbx may also work. - reversing : all the previous tools are useful ; I also use a disassembler (SunOS dis), but some exist for other platforms (see Bruce Ediger's homepage). If you plan to give a try to SPARC assembly, here are some links : http://www.cs.unm.edu/~maccabe/classes/341/labman/labman.html ISEM (Instructional Sparc EMulator) homepage. http://www.csn.net/~bediger/ Bruce Ediger Homepage http://www.cs.earlham.edu/~mutioke/cs63/ a good introduction to SPARC with plenty of links. http://www.sics.se/~psm/sparcstack.html a very good overview of SPARC stack and registers. Final Words ------------ If you use a SPARC based machine, give a try to assembly, it's quite fun. If not, remember that the best you know your processor, the best you can code ASM. ::/ \::::::. :/___\:::::::. /| \::::::::. :| _/\:::::::::. :| _|\ \::::::::::. :::\_____\:::::::::::...........................................FEATURE.ARTICLE Extending NASM by mammon_ Programmers transitioning to NASM from a commercial assembler such as MASM or TASM immediately notice the lack of any high-level language structures -- the assembly syntax accepted by NASM is only slightly more sophisticated than what you would find in a debugger. While this has its good side --smaller code size, nothing hidden from the programmer-- it does make coding a bit more tedious. For this reason NASM comes with a preprocessor that is both simple and powerful; by writing NASM macros, the high-level functionality of other assemblers can be emulated rather easily. As thw following macros will demonstrate, most of the high-level asm features in commercial assemblers really do not do anything very elaborate; they simply are more convenient for the programmer. The macros that I will detail below provide some basic C and ASM constructs for use in NASM. I have made the complete file available at http://www.eccentrica.org/Mammon/macros.asm The macro file can be included in a .asm file with the NASM directive %INCLUDE "macros.asm" Comments on the usage of each macro are included in the file. Macro Basics ------------ The fundamenal structure of a NASM macro is %macro {macroname} {# parameters} %endmacro The actual code resides on the line between the %macro and %endmacro tags; this code will be inserted into your program wherever NASM finds {macroname}. Thus you could create a macro to push the contents of each register such as: %macro SAVE_REGS 0 push eax push ebx push ecx push edx %endmacro Once you have defined this macro, you can use it in your code like: SAVE_REGS call ReadFile ...which the preprocessor will expand to push eax push ebx push ecx push edx call ReadFile before assembling. It should be noted that all preprocessing takes place in a single stage immediately before compiling starts; to preview what the pre- processor will send to the assembler, you can invoke nasm with the -e option. The %macro tag requires that you declare the number of paramters that will be passed to the macro. This can be a single number or a range, with a few quirks: %macro LilMac 0 ; takes 0 arguments %macro LilMac 5 ; takes 5 arguments %macro LilMac 0-3 ; takes 0-3 arguments %macro LilMac 1-* ; takes 1 to unlimited arguments %macro LilMac 1-2+ ; takes 1-2 arguments %macro LilMac 1-3 0, "OK" ; takes 1-3 arguments, 2-3 default to 0 & "OK" The last three examples bear some explanation. The "-*" operator in the %macro tag specifies that the macro can handle any number of parameters; in other words, there is no maximum number, and the minimum is whatever number is to the left of the "-*" operator. The "+" operator means that any additional arguments will be appended to the last argument instead of causing an error, so that: LilMac 0, OK, This argument is one too many will result in argument 1 being 0 and argument 2 being "OK, This argument is one too many." Note that this is a good way to pass commas as part of an argu- ment (normally they are only separators). Providing defualt arguments after the number of arguments allows a macro to be called with fewer arguments than it expects. %macro SAVE_VARS 1-4 ecx, ebx, eax will fill a missing 4th argument with eax, 3rd with ebx, and 2nd with ecx. Note that you have to provide defaults starting with the last argument and working backwards. The parameters to the macro are available as %1 for the first argument, %2 for the second, and so on, with %0 containing a count of all the arguments. There is an equivalent to the DOS "SHIFT" command called %rotate which will rotate the parameters to either the left or to the right depending on whether a positive or negative value was supplied: Before: %1 %2 %3 %4 Before: %1 %2 %3 %4 Before: %1 %2 %3 %4 %rotate 1 %rotate -1 %rotate 2 After: %4 %1 %2 %3 After: %2 %3 %4 %1 After: %3 %4 %1 %2 So that rotating by 1 will put the value at %1 into %4, and rotating by -1 will put the value of %1 into %2. High-Level Calls ---------------- Perhaps the buggest complaint about NASM is its primitive call syntax. In MASM and TASM, the parameters to a call may be appended to the call itself: call MessageBox, hOwner, lpszText, lpszTitle, fuStyle where in NASM the parameters must be pushed onto the stack prior to the call: push fuStyle push lpszTitle push lpszText push hOwner call MessageBox Using NASM's "-*" macro feature along with the %rep directive make a high-level call easy to replicate: %macro call 2-* %define _func %1 %rep &0-1 %rotate 1 push %1 %endrep call _func %endmacro The %define directive simply defines the variable _func [underscores should prefix variable names in macros so you do not mistakenly use the same name later in the program] as %1, the name of the function to call. The %rep and %endrep directives enclose the instructions to be repeated, and %rep takes as a parameter the number of repetitions [in this case set to the number of macro parameters minus 1]. Thus, the above macro cycles through the arguments to call and pushes them last-argument first [C syntax] before making the call. Overloading an existing instruction such as call will cause warnings at compile time [remember, the preprocessor thinks you are doing a recursive macro invoke] so usually you will want to name the macro "c_call" or something similar. The following macros provide facilities for C, Pascal, fastcall, and stdcall call syntaxes. ;==============================================================-High-Level Call ; ccall FuncName, param1, param2, param 3... ;Pascal: 1st-1st, no clean ; pcall FuncName, param1, param2, param 3... ;C: Last-1st, stack cleanup ; stdcall FuncName, param1, param2, param 3... ;StdCall: last-1st, no clean ; fastcall FuncName, param1, param2, param 3... ;FastCall: registers/stack %macro pcall 2-* %define _j %1 %rep %0-1 %rotate -1 push %1 %endrep call _j %endmacro %macro ccall 2-* %define _j %1 %assign __params %0-1 %rep %0-1 %rotate -1 push %1 %endrep call _j %assign __params __params * 4 add esp, __params %endmacro %macro stdcall 2-* %define _j %1 %rep %0-1 %rotate -1 push %1 %endrep call _j %endmacro %macro fastcall 2-* %define _j %1 %assign __pnum 1 %rep %0-4 %rotate -1 %if __pnum = 1 mov eax, %1 %elif __pnum = 2 mov edx, %1 %elif __pnum = 3 mov ebx, %1 %else push %1 %endif %assign __pnum __pnum+1 %endrep call _j %endmacro ;==========================================================================-END Switch-Case Blocks ------------------ One of the most awkward C constructs to code in assembly is the SWITCH-CASE block. It is also rather difficult to re-create as a macro due to variable number and length of CASE statements. NASM's preprocessor has a context stack which allows you to create a set of local variables and addresses which is specific to a particular invocation of a macro. Thus it becomes possible to refer to labels which will be created in a future macro by giving them context-dependent names: %macro MacPart1 0 %push mac ;create a context called "mac" jmp %$loc ;jump to context-specific label "loc" %endmacro %macro MacPart2 0 %ifctx mac ;if we are in context 'mac' %$loc: ;define label 'loc' xor eax, eax ;code at this label... ret %endif ;end the if block %pop ;destroy the 'mac' context %endmacro As you can see, the context is created and named with a %push directive, and destroyed with a $pop directive. NASM has a number of preprocessor conditional IF/ELSE statements; in the above example, the %ifctx [if current context equals] directive is used to determine if a 'mac' context has been created [Note that the 'base' NASM conditionals include %if, %elif, %else, and %endif; these carry over to the %ifctx directive, such that there is available %ifctx, %ifnctx, %elifctx, %elifnctx, %else, and %endif; all %if directives must be closed with an %endif directive]. Finally, %$ is used to prefix the name of a context- specific variable or label. Non-context-specific local labels use the %% prefix: %macro LOOP_XOR %%loop: pop eax xor eax, ebx test eax, eax jnz %%loop %endmacro The SWITCH-CASE macro that follows uses the syntax: SWITCH Variable CASE Int BREAK CASE Int BREAK DEFAULT ENDSWITCH Which could be implemented as follows: card db 0 ;card_variable Jack EQU 11 Queen EQU 12 King EQU 13 ... SWITCH card CASE Jack add edx, Jack BREAK CASE Queen add edx, Queen BREAK CASE King add edx, King BREAK DEFAULT add d, [card] ENDSWITCH Note that SWITCH moves the variable into eax and CASE moves the value into ebx. ;===========================================================-SWITCH-CASE Blocks %macro SWITCH 1 %push switch %assign __curr 1 mov eax, %1 jmp %$loc(__curr) %endmacro %macro CASE 1 %ifctx switch %$loc(__curr): %assign __curr __curr+1 mov ebx, %1 cmp eax, ebx jne %$loc(__curr) %endif %endmacro %macro DEFAULT 0 %ifctx switch %$loc(__curr): %endif %endmacro %macro BREAK 0 jmp %$endswitch %endmacro %macro ENDSWITCH 0 %ifctx switch %$endswitch: %pop %endif %endmacro ;==========================================================================-END If-Then Blocks -------------- While the preprocessor provides support for if-then directives, it is a slight bit of work to cause that to generate the equivalent assembly language 'if' code [ the preprocessor 'if' is resolved before compile time, not at run time]. Using macros, you can create if-then blocks with the following structure: IF Value, Cond, Value ;if code here ELSIF Value, Cond, Value ;else-if code here ELSE ;else code here ENDIF An example being: IF [Passwd], e, [GoodVal] ;e == equals or je jmp Registered ELSE jmp FormatHardDrive ENDIF The trickiest part about this macro sequence is the 'Cond' parameter. NASM allows condition codes [the 'cc' in 'jcc' that you findin opcode refs] to be passed to macros; these condition codes are simply the 'jcc' with the 'j' cut off -- 'jnz' becomes 'nz', 'jne' becomes 'ne', 'je' becomes 'e', and so on. The reason for this is that the condition code is appended to a 'j' later in the macro: %macro Jumper %1 %2 %3 ;JUMPER Reg1, cc, Reg2 cmp %1, %3 j%+2 Gotcha jmp error %endmacro The above code appends %2 to the 'j' with the directive j%+2. Note that if you use j%- instead of j%+, NASM will insert the *inverse* condition code, so that jz becomes jnz, etc. For example, calling the macro %macro Jumper2 %1 j%-1 JmpHandler %endmacro with the invocation 'Jumper2 nz' would assemble the code 'jz JmpHandler'. The condition codes can be a bit tricky to work with; it is advisable to add a sequence such as the following to the macro file: %define EQUAL e %define NOTEQUAL ne %define G-THAN g %define L-THAN l %define G-THAN-EQ ge %define L-THAN-EQ le %define ZERO z %DEFINE NOTZERO nz so that you could call the IF macro as follows: IF PassWd, EQUAL, GoodVal ;if code here ...etc etc. Note also that the IF-THEN-ELSE macros put the passed values into eax and ebx for compatison, so these registers will need to be preserved. ;===========================================================-IF-THEN-ELSE Loops %macro IF 3 %push if %assign __curr 1 mov eax, %1 mov ebx, %3 cmp eax, ebx j%+2 %%if_code jmp %$loc(__curr) %%if_code: %endmacro %macro ELSIF 3 %ifctx if jmp %$end_if %$loc(__curr): %assign __curr __curr+1 mov eax, %1 mov ebx, %3 cmp eax, ebx j%+2 %%elsif_code jmp %$loc(__curr) %%elsif_code: %else %error "'ELSIF' can only be used following 'IF'" %endif %endmacro %macro ELSE 0 %ifctx if jmp %$end_if %$loc(__curr): %assign __curr __curr+1 %else %error "'ELSE' can only be used following an 'IF'" %endif %endmacro %macro ENDIF 0 %$loc(__curr): %$end_if: %pop %endmacro ;==========================================================================-END For/While Loops --------------- The DO...FOR and DO...WHILE do nothing differnet from the previous macros, but are simply a different application of the same principles. The syntax for calling these macros is: DO ;code to do here FOR min, Cond, max, step DO ;code to do here WHILE variable, Cond, value It is perhaps easiest to illustrate this by comparing the macros with C code. for( x = 0; x <= 100; x++) { SomeFunc() } Equates to: DO call SomeFunc FOR 0, l, 100, 1 Likewise, for( x = 0; x != 100; x--) { SomeFunc() } Equates to: DO call SomeFunc FOR 0, e, 100, -1 The WHILE macro is similar: while( CurrByte != BadAddr) {SomeFunc() } Equates to: DO call SomeFunc WHILE CurrByte, ne, BadAddr Once again, eax and ebx are used in the FOR and WHILE macros. ;====================================================-DO-FOR and DO-WHILE Loops %macro DO 0 %push do jmp %$init_loop %$start_loop: push eax %endmacro %macro FOR 4 %ifctx do pop eax add eax, %4 cmp eax, %3 j%-2 %%end_loop jmp %$start_loop %$init_loop: mov eax, %1 jmp %$start_loop %%end_loop: %pop %endif %endmacro %macro WHILE 3 %ifctx do pop eax mov ebx, %3 cmp eax, ebx j%+2 %%end_loop jmp %$start_loop %$init_loop: mov eax, %1 jmp %$start_loop %%end_loop: %pop %endif %endmacro ;==========================================================================-END Data Declarations ----------------- Declaring data is relatively simple in assembly, but sometimes it helps to make code more clear if you create macros that assign meaningful data types to variables, even if those macros simply resolve to a DB or a DD. The following macros demonstrate this concept. They are invoked as follows: CHAR Name, String ;e.g. CHAR UserName, "Joe User" INT Name, Byte ;e.g. INT Timeout, 30 WORD Name, Word ;e.g. WORD Logins DWORD Name, Dword ;e.g. DWORD Password Note that when invoked with a name but not a value, these macros create empty [DB 0] variables. ;============================================================-Data Declarations %macro CHAR 1-2 0 %1: DB %2,0 %endmacro %macro INT 1-2 0 %1: DB %2 %endmacro %macro WORD 1-2 0 %1: DW %2 %endmacro %macro DWORD 1-2 0 %1: DD %2 %endmacro ;==========================================================================-END Procedure Declarations ---------------------- Procedure declarations are another matter of convenience. It is often useful in your code to clearly delineate the start and end of a procedure; each of the PROC macros below does that, as well as creating a stack fram for the procedure. The ENTRYPROC macro creates a procedure named 'main' and declares main as a global symbol; the standard PROC declares the provided name as global. These macros can be used as follows: PROC ProcName Parameter1, Parameter2, Parameter3 ;procedure code here ENDP ENTRYPROC ;entry-procedure code here ENDP Note that the Parameters to PROC are set up to EQU to offsets from ebp, e.g. ebp-4, ebp-8, etc. I have also included support for local variables, which will EQU to positive offsets from ebp' these may be used as follows: PROC ProcName Parameter1, Parameter2, Parameter3... LOCALDD Dword_Variable LOCALDW Word_Variable LOCALDB Byte_Variable ;procedure code here ENDP ;=======================================================-Procedure Declarations %macro PROC 1-9 GLOBAL %1 %1: %assign _i 4 %rep %0-1 %2 equ [ebp-_i] %assign _i _i+4 %rotate 1 %endrep push ebp mov ebp, esp %push local %assign __ll 0 %endmacro %macro ENDP 0 %ifctx local %pop %endif pop ebp %endmacro %macro ENTRYPROC 0 PROC main %endmacro %macro LOCALVAR 1 sub esp, 4 %1 equ [ebp + __ll] %endmacro %macro LOCALDB 1 %assign __ll __ll+1 LOCALVAR %1 %endmacro %macro LOCALDW 1 %assign __ll __ll+2 LOCALVAR %1 %endmacro %macro LOCALDD 1 %assign __ll __ll+4 LOCALVAR %1 %endmacro ;==========================================================================-END Further Extension ----------------- Continued experimentation will of course prove fruitful. It is recommended that you read/print out chapter 4 of the NASM manual for reference. In addition, it is very helpful to test your macros by cpmpiling the source with "nasm -e", which will output the preprocessed source code to stdout and will not compile the program. ____________________________________________________________________________ ______ _____. ____ ``` ._____/\______.________._________. ._\___ |__\_ /. \\ | | | _ | | (_ | __/ |CE , .=|_____/\______| |----)____| |______|______|======[ Win 32 ASM ]===. '===============| :===================================================' NASM specific Win32 coding by Tamas Kaproncai Contents ======== 0. Preface 1. Compiling 2. Include files 3. Library files 4. Importing API functions 5. Calling API functions 6. WinMain 7. Window procedure 8. Sections 9. Self modification 0. Preface ========== I will introduce the win32 coding and I will focus on the NASM specific part. Downloadable working examples: ftp://ftp.szif.hu/pub/demos/tool/w32nasm.zip http://rs1.szif.hu/~tomcat/win32 There is another tutorial on this topic, called: "The Win32 NASM Coding Toolkit v0.02 by Gij" that uses the LCC linker and the resource compiler which comes with LCC. 1. Compiling ============ I'm working with the following free programs in connection with NASM: - linker: ALINK v1.5 by Anthony A.J. Williams. - resource compiler: GoRC v0.50b by Jeremy Gordon. The process of compiling a win32 program involves a number of steps which can be divided into three main processes: preparing the include files, preparing the library files, and writing the actual program. The compiling flow chart ------------------------ .h -> ? -> .inc \ .asm -> NASM -> .obj \ .rc -> GORC -> .res -> ALINK -> .exe / .dll -> IMPLIB -> .lib (? means handwork) 2. Include files ================ The include files (*.inc) must be generated from existing header files (*.h) that come with win32-compatible C or Pascal compilers. Files needed: WIN32N.INC (Thanks for the inital MASM version to S.L.Hutchesson). The compiler will be NASM version 0.97 http://www.cryogen.com/nasm Usage: nasmw -fobj -w+orphan-labels -pwin32n.inc %1.asm 3. Library files ================ Files Needed: WIN32.LIB The linker will be ALINK http://www.geocities.com/SiliconValley/Network/4311/#alink Usage: alink -oPE %1 win32.lib %1.res %2 %3 More lib files can be created with IMPLIB. Example: IMPLIB DDRAW.DLL 4. Importing API functions ========================== EXTERN MessageBoxA IMPORT MessageBoxA use32.dll 5. Calling API functions ======================== PUSH UINT MB_OK PUSH LPCTSTR title1 PUSH LPCTSTR string1 PUSH HWND NULL CALL [MessageBoxA] 6. WinMain ========== You don't need to use the name, WinMain: You must start the program with the label, ..start: At the begening there is nothing special in the stack, so you should call GetModuleHandleA for hInstance and GetCommandLineA for the command line. (Command line consists the full path, the file name and the parameters). You can exit the program with: RETN or you should call the ExitProcess function: PUSH UINT 0 ; the error code CALL [ExitProcess] 7. Window procedure =================== There are four parameters on the top of the stack: PUSH EBP MOV EBP,ESP %DEFINE hwnd EBP+8 ;handle of window %DEFINE message EBP+12 ;message %DEFINE wParam EBP+16 ;first message parameter %DEFINE lParam EBP+20 ;second message parameter You can handle the messages depends on WPARAM [wParam] and the rest you can pass to DefWindowProcA: PUSH LPARAM [lParam] PUSH WPARAM [wParam] PUSH UINT [message] PUSH HWND [hwnd] CALL [DefWindowProcA] POP EBP RETN 16 8. Sections =========== You need a code section: SECTION CODE USE32 CLASS=CODE and a data section: SECTION DATA USE32 CLASS=DATA You don't need bss section, instead of you should append every RESB, RESW, RESD, RESQ to the end of the source code. This zero data not will be included to the exe file. 9. Self modification ==================== You can include your code and data together in one section: SECTION CODE USE32 CLASS=CODE In that case you need another object file, with only one line source: SECTION CODE USE32 CLASS=DATA ALINK will combine the properties of these two sections. EXTERN MessageBoxA EXTERN ExitProcess SECTION CODE USE32 CLASS=CODE ..start: PUSH UINT MB_OK PUSH LPCTSTR title1 PUSH LPCTSTR string1 PUSH HWND NULL CALL MessageBoxA PUSH UINT NULL CALL ExitProcess SECTION DATA USE32 CLASS=DATA string1: db 'Hello world!',13,10,0 title1: db 'Hello',0 ____________________________________________________________________________ ______ _____. ____ ``` ._____/\______.________._________. ._\___ |__\_ /. \\ | | | _ | | (_ | __/ |CE , .=|_____/\______| |----)____| |______|______|======[ Win 32 ASM ]===. '===============| :===================================================' More about Text by Iczelion We will experiment more with text attributes, ie. font and color. Preliminary: ------------ Windows color system is based on RGB values, R=red, G=Green, B=Blue. If you want to specify a color in Windows, you must state your desired color in terms of these three major colors. Each color value has a range from 0 to 255 (a byte value). For example, if you want pure red color, you should use 255,0,0. Or if you want pure white color, you must use 255,255,255. You can see from the examples that getting the color you need is very difficult with this system since you have to have a good grasp of how to mix and match colors. For text color and background, you use SetTextColor and SetBkColor, both of them require a handle to device context and a 32-bit RGB value. The 32-bit RGB value's structure is defined as: RGB_value struct unused db 0 blue db ? green db ? red db ? RGB_value ends Note that the first byte is not used and should be zero. The order of the remaining three bytes is reversed,ie. blue, green, red. However, we will not use this structure since it's cumbersome to initialize and use. We will create a macro instead. The macro will receive three parameters: red, green and blue values. It'll produce the desired 32-bit RGB value and store it in eax. The macro is as follows: RGB macro red,green,blue xor eax,eax mov ah,blue shl eax,8 mov ah,green mov al,red endm You can put this macro in the include file for future use. You can "create" a font by calling CreateFont or CreateFontIndirect. The difference between the two functions is that CreateFontIndirect receives only one parameter: a pointer to a logical font structure, LOGFONT. CreateFontIndirect is the more flexible of the two especially if your programs need to change fonts frequently. However, in our example, we will "create" only one font for demonstration, we can get away with CreateFont. After the call to CreateFont, it will return a handle to a font which you must select into the device context. After that, every text API function will use the font we have selected into the device context. Content: -------- Below is our source code: ;======================================================================TEXT.ASM include windows.inc includelib user32.lib includelib kernel32.lib includelib gdi32.lib RGB macro red,green,blue xor eax,eax mov ah,blue shl eax,8 mov ah,green mov al,red endm .data ClassName db "SimpleWinClass",0 AppName db "Our First Window",0 TestString db "Win32 assembly is great and easy!",0 FontName db "script",0 .data? hInstance HINSTANCE ? CommandLine LPSTR ? .code start: invoke GetModuleHandle, NULL mov hInstance,eax invoke GetCommandLine invoke WinMain, hInstance,NULL,CommandLine, SW_SHOWDEFAULT invoke ExitProcess,eax WinMain proc hInst:HINSTANCE,hPrevInst:HINSTANCE,CmdLine:LPSTR,CmdShow:SDWORD LOCAL wc:WNDCLASSEX LOCAL msg:MSG LOCAL hwnd:HWND mov wc.cbSize,SIZEOF WNDCLASSEX mov wc.style, CS_HREDRAW or CS_VREDRAW mov wc.lpfnWndProc, OFFSET WndProc mov wc.cbClsExtra,NULL mov wc.cbWndExtra,NULL push hInstance pop wc.hInstance mov wc.hbrBackground,COLOR_WINDOW+1 mov wc.lpszMenuName,NULL mov wc.lpszClassName,OFFSET ClassName invoke LoadIcon,NULL,IDI_APPLICATION mov wc.hIcon,eax mov wc.hIconSm,0 invoke LoadCursor,NULL,IDC_ARROW mov wc.hCursor,eax invoke RegisterClassEx, addr wc invoke CreateWindowEx,NULL,ADDR ClassName,ADDR AppName,\ WS_OVERLAPPEDWINDOW,CW_USEDEFAULT,\ CW_USEDEFAULT,CW_USEDEFAULT,CW_USEDEFAULT,NULL,NULL,\ hInst,NULL mov hwnd,eax invoke ShowWindow, hwnd,SW_SHOWNORMAL invoke UpdateWindow, hwnd .WHILE TRUE invoke GetMessage, ADDR msg,NULL,0,0 .BREAK .IF (!eax) invoke TranslateMessage, ADDR msg invoke DispatchMessage, ADDR msg .ENDW mov eax,msg.wParam ret WinMain endp WndProc proc hWnd:HWND, uMsg:UINT, wParam:WPARAM, lParam:LPARAM LOCAL hdc:HDC LOCAL ps:PAINTSTRUCT LOCAL hfont:HFONT mov eax,uMsg .IF eax==WM_DESTROY invoke PostQuitMessage,NULL .ELSEIF eax==WM_PAINT invoke BeginPaint,hWnd, ADDR ps mov hdc,eax invoke CreateFont,24,16,0,0,400,0,0,0,OEM_CHARSET,\ OUT_DEFAULT_PRECIS,CLIP_DEFAULT_PRECIS,\ DEFAULT_QUALITY,DEFAULT_PITCH or FF_SCRIPT,\ ADDR FontName invoke SelectObject, hdc, eax mov hfont,eax RGB 200,200,50 invoke SetTextColor,hdc,eax RGB 0,0,255 invoke SetBkColor,hdc,eax invoke TextOut,hdc,0,0,ADDR TestString,SIZEOF TestString invoke SelectObject,hdc, hfont invoke EndPaint,hWnd, ADDR ps .ELSE invoke DefWindowProc,hWnd,uMsg,wParam,lParam ret .ENDIF xor eax,eax ret WndProc endp end start ;===========================================================================EOF Let's begin our analysis : ) invoke CreateFont,24,16,0,0,400,0,0,0,OEM_CHARSET,\ OUT_DEFAULT_PRECIS,CLIP_DEFAULT_PRECIS,\ DEFAULT_QUALITY,DEFAULT_PITCH or FF_SCRIPT,\ ADDR FontName CreateFont creates a logical font that is the closest match to the given parameters and the font data available. This function has more parameters than any other function in Windows. It returns a handle to logical font to be used by SelectObject function. We will examine its parameters in detail. HFONT CreateFont(int nHeight, int nWidth, int nEscapement, int nOrientation, int nWeight, BYTE cItalic, BYTE cUnderline, BYTE cStrikeOut, BYTE cCharSet, BYTE cOutputPrecision, BYTE cClipPrecision, BYTE cQuality, BYTE cPitchAndFamily, LPSTR lpFacename); nHeight --> The desired height of the characters . 0 means use default size. nWidth --> The desired width of the characters. Normally this value should be 0 which allows Windows to match the width to the height. However, in our example, the default width makes the characters hard to read, so I use the width of 16 instead. nEscapement --> Specifies the orientation of the next character output relative to the previous one in tenths of a degree. Normally, set to 0. Set to 900 to have all the characters go upward from the first character, 1800 to write backwards, or 2700 to write each character from the top down. nOrientation --> Specifies how much the character should be rotated when output in tenths of a degree. Set to 900 to have all the characters lying on their backs, 1800 for upside-down writing, etc. nWeight --> Sets the line thickness of each character. Windows defines the following sizes: FW_DONTCARE equ 0 FW_THIN equ 100 FW_EXTRALIGHT equ 200 FW_ULTRALIGHT equ 200 FW_LIGHT equ 300 FW_NORMAL equ 400 FW_REGULAR equ 400 FW_MEDIUM equ 500 FW_SEMIBOLD equ 600 FW_DEMIBOLD equ 600 FW_BOLD equ 700 FW_EXTRABOLD equ 800 FW_ULTRABOLD equ 800 FW_HEAVY equ 900 FW_BLACK equ 900 cItalic --> 0 for normal, any other value for italic characters. cUnderline --> 0 for normal, any other value for underlined characters. cStrikeOut --> 0 for normal, any other value for characters with a line through the center. cCharSet --> The character set of the font. Normally should be OEM_CHARSET which allows Windows to select font which is operating system-dependent. cOutputPrecision --> Specifies how much the selected font must be closely matched to the characteristics we want. Normally should be OUT_DEFAULT_PRECIS which defines default font mapping behavior. cClipPrecision --> Specifies the clipping precision. The clipping precision defines how to clip characters that are partially outside the clipping region. You should be able to get by with CLIP_DEFAULT_PRECIS which defines the default clipping behavior. cQuality -->Specifies the output quality. The output quality defines how carefully GDI must attempt to match the logical-font attributes to those of an actual physical font. There are three choices: DEFAULT_QUALITY, PROOF_QUALITY and DRAFT_QUALITY. cPitchAndFamily --> Specifies pitch and family of the font. You must combine the pitch value and the family value with "or" operator. lpFacename A pointer to a null-terminated string that specifies the typeface of the font. The description above is by no means comprehensive. You should refer to your Win32 API reference for more details. invoke SelectObject, hdc, eax mov hfont,eax After we get the handle to the logical font, we must use it to select the font into the device context by calling SelectObject. SelectObject puts the new GDI objects such as pens, brushs, and fonts into the device context to be used by GDI functions. It returns the handle to the replaced object which we should save for future SelectObject call. After SelectObject call, any text output function will use the font we just selected into the device context. RGB 200,200,50 invoke SetTextColor,hdc,eax RGB 0,0,255 invoke SetBkColor,hdc,eax Use RGB macro to create a 32-bit RGB value to be used by SetColorText and SetBkColor. invoke TextOut,hdc,0,0,ADDR TestString,SIZEOF TestString Call TextOut function to draw the text on the client area. The text will be in the font and color we specified previously. The syntax of TextOut is as follows: BOOL TextOut( HDC hdc, // handle of device context int nXStart, // x-coordinate of starting position int nYStart, // y-coordinate of starting position LPCTSTR lpString, // address of string int cbString // number of characters in string ); invoke SelectObject,hdc, hfont When we are through with the font, we should restore the old font back into the device context. You should always restore the object that you replaced in the device context. ____________________________________________________________________________ ______ _____. ____ ``` ._____/\______.________._________. ._\___ |__\_ /. \\ | | | _ | | (_ | __/ |CE , .=|_____/\______| |----)____| |______|______|======[ Win 32 ASM ]===. '===============| :===================================================' Keyboard Input by Iczelion We will learn how a Windows program receives keyboard input. Preliminiary: ------------ Since there's only one keyboard in each PC, all running Windows programs must share it between them. Windows is responsible for sending the key strokes to the window which has the input focus. Although there may be several windows on the screen, only one of them has the input focus. The window which has input focus is the only one which can receive key strokes. You can differentiate the window which has input focus from other windows by looking at the title bar which is highlighted. Actually, there are two main types of keyboard message. You can view a keyboard as a group of keys. For example, if you press the "a" key, Windows sends a WM_KEYDOWN message to the window which has input focus, notifying that a key is pressed. When you release the key, Windows sends a WM_KEYUP message. In this case, you treat a key as a button. Another way to look at the keyboard is that it's a character input device. When you press "a" key, Windows sends a WM_CHAR message to the window which has input focus, telling it that the user sends "a" character to it. In fact, Windows sends WM_KEYDOWN, WM_CHAR, and WM_KEYUP messages to the window which has input focus. The window procedure may decide to process all three messages or only the messages it's interested in. Most of the time, you can ignore WM_KEYDOWN and WM_KEYUP since TranslateMessage function call in the message loop translate WM_KEYDOWN and WM_KEYUP messages to a WM_CHAR message. We will focus on WM_CHAR in this tutorial. Content: ------- ;=======================================================================KEY.ASM include windows.inc includelib user32.lib includelib kernel32.lib includelib gdi32.lib .data ClassName db "SimpleWinClass",0 AppName db "Our First Window",0 char WPARAM 20h ; the character the program receives from keyboard .data? hInstance HINSTANCE ? CommandLine LPSTR ? .code start: invoke GetModuleHandle, NULL mov hInstance,eax invoke GetCommandLine invoke WinMain, hInstance,NULL,CommandLine, SW_SHOWDEFAULT invoke ExitProcess,eax WinMain proc hInst:HINSTANCE,hPrevInst:HINSTANCE,CmdLine:LPSTR,CmdShow:SDWORD LOCAL wc:WNDCLASSEX LOCAL msg:MSG LOCAL hwnd:HWND mov wc.cbSize,SIZEOF WNDCLASSEX mov wc.style, CS_HREDRAW or CS_VREDRAW mov wc.lpfnWndProc, OFFSET WndProc mov wc.cbClsExtra,NULL mov wc.cbWndExtra,NULL push hInstance pop wc.hInstance mov wc.hbrBackground,COLOR_WINDOW+1 mov wc.lpszMenuName,NULL mov wc.lpszClassName,OFFSET ClassName invoke LoadIcon,NULL,IDI_APPLICATION mov wc.hIcon,eax mov wc.hIconSm,0 invoke LoadCursor,NULL,IDC_ARROW mov wc.hCursor,eax invoke RegisterClassEx, addr wc invoke CreateWindowEx,NULL,ADDR ClassName,ADDR AppName,\ WS_OVERLAPPEDWINDOW,CW_USEDEFAULT,\ CW_USEDEFAULT,CW_USEDEFAULT,CW_USEDEFAULT,NULL,NULL,\ hInst,NULL mov hwnd,eax invoke ShowWindow, hwnd,SW_SHOWNORMAL invoke UpdateWindow, hwnd .WHILE TRUE invoke GetMessage, ADDR msg,NULL,0,0 .BREAK .IF (!eax) invoke TranslateMessage, ADDR msg invoke DispatchMessage, ADDR msg .ENDW mov eax,msg.wParam ret WinMain endp WndProc proc hWnd:HWND, uMsg:UINT, wParam:WPARAM, lParam:LPARAM LOCAL hdc:HDC LOCAL ps:PAINTSTRUCT mov eax,uMsg .IF eax==WM_DESTROY invoke PostQuitMessage,NULL .ELSEIF eax==WM_CHAR push wParam pop char invoke InvalidateRect, hWnd,NULL,TRUE .ELSEIF eax==WM_PAINT invoke BeginPaint,hWnd, ADDR ps mov hdc,eax invoke TextOut,hdc,0,0,ADDR char,1 invoke EndPaint,hWnd, ADDR ps .ELSE invoke DefWindowProc,hWnd,uMsg,wParam,lParam ret .ENDIF xor eax,eax ret WndProc endp end start ;===========================================================================EOF Let's analyze it: char WPARAM 20h ; the character the program receives from keyboard This is the variable that stores the character received from the keyboard. Since the character is sent in WPARAM of the window procedure, we define the variable as type WPARAM for simplicity. The initial value is 20h or the space since when our window refreshes its client area the first time, there is no character input. So we want to display space instead. .ELSEIF eax==WM_CHAR push wParam pop char invoke InvalidateRect, hWnd,NULL,TRUE This is added in the window procedure to handle the WM_CHAR message. It just puts the character into the variable named "char" and then calls InvalidateRect. InvalidateRect makes a specified rectangle in the client area invalid which forces Windows to send WM_PAINT message to the window procedure. Its syntax is as follows: BOOL InvalidateRect( HWND hWnd, // handle of window with changed update region CONST RECT * lpRect, // address of rectangle coordinates BOOL bErase // erase-background flag ); lpRect is a pointer to the rectagle in the client area that we want to declare invalid. If this parameter is null, the entire client area will be marked as invalid. bErase is a flag telling Windows if it needs to erase the background. If this flag is TRUE, then Windows will erase the backgroud of the invalid rectangle when BeginPaint is called. So the strategy we used here is that: we store all necessary information about how to paint the client area and generate WM_PAINT message to paint the client area. Of course, the codes in WM_PAINT section must know beforehand what's expected of them. This seems a roundabout way of doing things but it's the way of Windows. Actually we can paint the client area during processing WM_CHAR message by calling GetDC and ReleaseDC pair. There is no problem there. But the fun begins when our window needs to repaint its client area. Since the codes that paint the character are in WM_CHAR section, the window procedure will not be able to repaint our character in the client area. So the bottom line is: put all necessary data and codes that do painting in WM_PAINT. You can send WM_PAINT message from anywhere in your code anytime you want to repaint the client area. invoke TextOut,hdc,0,0,ADDR char,1 When InvalidateRect is called, it sends a WM_PAINT message back to the window procedure. So the codes in WM_PAINT section is called. It calls BeginPaint as usual to get the handle to device context and then call TextOut which draws our character in the client area at x=0, y=0. When you run the program and press any key, you will see that character echo in the upper left corner of the client window. And when the window is minimized and maximized again, the character is still there since all the codes and data essential to repaint are all gathered in WM_PAINT section. ____________________________________________________________________________ ::::::::::.___ . ``` ::::::::::| _/__. |__ ____ . __. ____ ____ __. \\ :::::: |____ | __/_ _\_ (.___| .___) |__\_ (._) /___) | , ::::::::::/ / | \ | - | \ | - | - | \/| - | .=:::::::::/______|_____|_____| (___|_____|______|____|_____|===============. '=::::::::::==================| . ____ | (____====[ The C Standard lib ]===' :::::::::: | |------| - | :::::::::: | |______|______|CE . : C string functions: introduction, _strlen by Xbios2 I. INTRODUCTION --------------- Beware: this is going to be long... String handling in assembly is - anyway - a difficult subject. There are few string-oriented x86 opcodes, and most of them are slow. There is not a standard library providing even basic functions. There is no string specific syntax in assembly, like C's printf('hello world') or, even worse, BASIC's a$=b$+'hello'. In a few words, if easy string-related programming is your goal, maybe you should consider PERL, or another text-manipulation language. Yet, string functions are really needed, since almost any program in assembly uses text for I/O. (An alternative to this would be using animated paper-clips to communicate with the user :)). Furthermore, coding those functions in assembly allows for smaller and faster functions. Actually many of the string functions in C _were_ written in assembly (e.g. strlen, strcat, strcpy, etc). Those can be divided in two categories: -'Traditional' functions, using the x86 string instructions -'Modern' functions, which run faster by being Pentium-optimized Borland C++ 4.02 and KERNEL32.DLL only have traditional functions. Borland's C++ Builder v1.0 (once given free as a demo) includes both types. MSVCRT.DLL (version 5) contains 'modern' versions. The three main aspects considered in these articles (and generally when compar- ing different versions of the same function) are speed, size and common sense. 'Common sense' indicates how easy it is to understand the way a function operates by reading the source code, how 'elegant' the code is. In a library module distributed as a binary (in a 'static' reuse of code), common sense is not important. It becomes important when the source code is distributed too, because it allows 'dynamic' reuse. 'Elegant' code can be easily optimized for specific needs or expanded to become a more general function. 'Size' is, obviously, the size of the resulting code. Besides creating smaller files, small size has two interesting 'side-effects'. It (usually) creates more elegant code and faster code (it decreases k, but it usually increases l (for an explanation of k and l see 'speed'). For very small functions like strlen it has the added advantage of allowing the code to be inlined without wasting too much space, thus decreasing k even more. 'Speed' indicates the number of cycles needed to execute the function. For simple string functions the number of cycles needed can be expressed as c=k+l*n where c is the total number of cycles, k is the number of cycles needed to 'prepare' the function, l is the number of cycles needed to process each chara- cter and n is the number of characters in the string. It is obvious that small values of c mean faster execution. In order to compare two versions of a function that run at speeds of c1=k1+l1*n and c2=k2+l2*n the ratio of c1/c2 is calculated: c1 k1+l1*n r=----=--------- c2 k2+l2*n if r=1 then both versions run at the same speed. if r>1 then version 2 is faster. if r<1 then version 1 is faster. Simple maths prove that: 1. When n becomes infinite, r becomes equal to l1/l2. Especially if l1=l2, then r=1 2. If k1<k2 but l1>l2, c1<c2 (version 1 is faster) if n<(k2-k1)/(l1-l2). Point 1 means that for long strings speed is (almost) independent of the value of k. Especially if l1=l2 both versions will run at (almost) the same speed. Point 2 means that for small strings k strongly affects the value of c. For those of you that are fed up with maths, here is a simple example that demonstrates what I've been trying to say all this time :) If version 1 runs at c1=10+3*n and version 2 at c2=30+1*n then: -For strings up to 9 chars version 1 is faster -For strings of 10 chars both versions run at the same speed -For strings of 11 or more version 2 is faster -For strings of 50 chars, version 2 is 2x faster than version 1 -For strings of 770 chars version 2 is 2.9x faster than version 1 The problem is that none of the above versions can be classified as better than the other. Think of the parser of a compiler. It receives as input lines from a text file, which are strings longer than 10 characters, but also has to deal with tokens, which are short strings (in an assembler, three-char tokens are very common). Keep in mind that, while l depends only on the method used to implement the function, k also depends on the 'push arg/call/prepare stack/resore stack/ ret/get arg' times. So if n is low, overall speed can be increased by inlining the code, thus subtracting from k the time needed to call the function. Well, I think you've had enough. Let's see all this stuff in practice. II. THE _STRLEN FUNCTION ------------------------ Attention: especially for _strlen, ALL versions I have either written or found in libraries will be explained. This means you'll get source code for 8 functions... size_t strlen(const char *s); Calculates the length of a string. strlen returns the number of characters in s, not counting the null-terminating character. _strlen is the simplest of the string functions. The 'traditional' way to implement it is through 'repne scasb'. BC 4.02 implements it as: ; ------------ version 1 ------------ ; Borland C++ 4.02 ; 25 bytes ; c=27+4*n _strlen proc near push ebp mov ebp, esp push edi mov edi, [ebp+8] mov ecx, -1 xor al, al cld repne scasb not ecx lea eax, [ecx-1] pop edi pop ebp retn _strlen endp ; ----------------------------------- A shorter, and a bit faster version of this would be: ; ------------ version 2 ------------ ; Improved 'repne scasb' ; 18 bytes ; c=21+4*n _strlen proc near xor eax, eax push edi mov edi, [esp+8] or ecx, -1 repne scasb sub eax, 2 pop edi sub eax, ecx retn _strlen endp ; ----------------------------------- The win32 API also includes a strlen function, called lstrlenA. It is based on 'repne scasb' as well, but you are _strongly_ advised to avoid it. It runs at c=56+4*n cycles. The most 'common sense' function (IMHO) is also the smallest: ; ------------ version 3 ------------ ; Elegant and very small ; 15 bytes ; c=27+4*n ; k is so big because we have a retn immediately after a jump ; if a nop is added between those two, k drops to 13 _strlen proc near or eax, -1 mov ecx, [esp+4] loop1: inc eax cmp byte ptr [ecx+eax], 0 jnz short loop1 ; nop retn _strlen endp ; ----------------------------------- Which gets a little less elegant but faster if tweaked a little: (The trick is that the carry flag is set by the 'cmp' instruction if the byte read is 0, else it is cleared. The 'inc' instruction doesn't affect the carry flag). ; ------------ version 4 ------------ ; Very small and faster than repne scasb ; 15 bytes ; c=12+3*n _strlen proc near mov ecx, [esp+4] xor eax, eax loop1: cmp byte ptr [ecx+eax], 1 inc eax jnc short loop1 dec eax retn _strlen endp ; ----------------------------------- But it gets even better as inlined code, as a macro: ; ------------ version 4.5 ------------ ; Very small, extremely elegant macro ; 10 bytes ; c=8+3*n strlen macro srcreg, cntreg xor cntreg, cntreg cmp byte ptr [srcreg+cntreg], 1 inc cntreg jnc $-5 dec eax endm ; ----------------------------------- This macro returns in cntreg the length of the string at srcreg. It uses no other registers, srcreg is unchanged, it is only 10 bytes long and it runs at a speed of only 8+3*n cycles. It also returns its value in any register, without altering the other registers. Suppose we need in ecx the length of the string at esi. The following code: push esi call _strlen pop ecx ; restore stack mov ecx, eax takes 9 bytes, only one less than the macro version. Plus, of course, the at least 15 bytes of code in _strlen. Another 'elegant' version, which is also small and much faster is the following: ; ------------ version 5 ------------ ; Elegant, small and fast ; 16 bytes ; c=12+2*n _strlen proc near mov ecx, [esp+4] xor eax, eax loop1: mov dl, [ecx+eax] inc eax or dl, dl jnz short loop1 dec eax retn _strlen endp ; ----------------------------------- I believe that version 5 is the best version that could have elegance, speed and small size together. It can also be converted to a macro and inlined to drop to a speed of c=8+2*n (it will use one register more, but this register would anyway be lost if a call to the function was made). It also has what I believe is the smallest value of k. However, it doesn't have the smallest value of l. To reduce the cycles needed, data can be read not byte after byte but dword after dword. Here is a routine given by Agner Fog in his document on Pentium optimization (which you are _strongly_ advised to read): ; ------------ version 6 ------------ ; [by Agner Fog] Very fast ; 61 bytes ; c=18+1*n (not exactly, as data is read in 4 byte blocks) _strlen proc mov eax, [esp+4] ; get pointer mov edx, 7 add edx, eax ; pointer+7 used in the end push ebx mov ebx, [eax] ; read first 4 bytes add eax, 4 ; increment pointer l1: lea ecx, [ebx-01010101h] ; subtract 1 from each byte xor ebx, -1 ; invert all bytes and ecx, ebx ; and these two mov ebx, [eax] ; read next 4 bytes add eax, 4 ; increment pointer and ecx, 80808080h ; test all sign bits jz l1 ; no zero bytes, continue loop test ecx, 00008080h ; test first two bytes jnz short l2 shr ecx, 16 ; not in the first 2 bytes add eax, 2 l2: shl cl, 1 ; use carry flag to avoid a branch pop ebx sbb eax, edx ; compute length ret _strlen endp ; ----------------------------------- The only problem with this routine is that it expects the string to be aligned on a 4 byte boundary. If the string is misaligned, the speed drops to c=24+1.75*n. In the extreme case that the string is misalinged AND it ends on a page boundary, the function will cause an access violation error. The fastest version (I have found) is the one in the Borland C++ builder library: ; ------------ version 7 ------------ ; [C++ Builder, slightly modified] Fastest ; 88 bytes ; c=20+0.75*n (not exactly, see notes) _strlen proc near mov eax, [esp+4] test al, 3 jnz short unalgn loop1: mov edx, [eax] add eax, 4 mov ecx, edx sub edx, 1010101h and edx, 80808080h jz short loop1 not ecx and edx, ecx jz short loop1 test dl, dl jnz short minus4 test dh, dh jnz short minus3 test edx, 0FF0000h jnz short minus2 jmp short minus1 unalgn: add eax, 3 xor cl, cl cmp byte ptr [eax-3], cl jz short minus3 cmp byte ptr [eax-2], cl jz short minus2 cmp byte ptr [eax-1], cl jz short minus1 and al, 0FCh jmp short loop1 minus4: dec eax minus3: dec eax minus2: dec eax minus1: mov ecx, [esp+4] dec eax sub eax, ecx retn _strlen endp ; ----------------------------------- Actually, the original version is 90 bytes long. I have only changed the 'unalgn:' block, to reduce k if the string is unaligned. This function works well even on unaligned strings, as it first check the unali- gned bytes, and the proceeds in the main loop with aligned data (for unaligned strings it runs at c=31+0.75*n cycles). Since all dwords read are aligned, unaligned strings that end on page boundaries don't cause problems. This function is not always the fastest. If the string contains characters in the range 128 to 255 (i.e. signed bytes) the speed drops. If all the characters are signed (actually if at least one byte in every dword read), the speed becomes c=1.25*n. Of course most of the time (especially for english text) this is not the case, but if you have to process strings in another language that has characters in the range 128 to 255, it is a bit slower. Another fast version of strlen can be found in MSVCRT.DLL (the one I checked is version 5.00.7303). It runs at c=20+1*n, and handles unaligned strings almost like the Builder version. Misaligned strings give a value of k ranging from a minimum of 25 to a maximum of 52. What the MSVCRT function lacks completely is common sense and small size. In fact it is 144 bytes long and it is divided in two different pieces of the dll's code, causing most jumps to be in the long form. The main loop MSVCRT uses is good, but the rest of the function isn't. Based on that function, I came up with the following one: ; ------------ version 8 ------------ ; My fast version ; 92 bytes ; c=17+1*n _strlen proc mov eax, [esp+4] xor ecx, ecx loop2: test al, 3 jz loop1 cmp byte ptr [eax], cl jz short ret0 cmp byte ptr [eax+1], cl jz short ret1 cmp byte ptr [eax+2], cl jnz short adjust inc eax ret1: inc eax ret0: sub eax, [esp+4] ret adjust: add eax, 3 and eax, 0FFFFFFFCh loop1: mov edx, [eax] mov ecx, 81010100h sub ecx, edx add eax, 4 xor ecx, edx and ecx, 81010100h jz loop1 sub eax, [esp+4] shr ecx, 9 jc minus4 shr ecx, 8 jc minus3 shr ecx, 8 jc minus2 minus1: dec eax ret minus4: sub eax, 4 ret minus3: sub eax, 3 ret minus2: sub eax, 2 ret _strlen endp ; ----------------------------------- This one has the advantage of having k=17 for aligned strings and k=24 to 25 for misaligned ones. The only question left to be answered is: 'Which version should we prefer?'. If your program does not include string handling in it's time-critical parts, I higly recommend either versions 5 or 4.5 (the inlined macro). As said before, the size overhead of the inlined version is very small (if any), and it has another advantage: it keeps the source code more readable, as it only involves the needed registers (input and output) in one single line. If string handling IS time-critical, I recommend version 8 (of course, it's mine... :)). Even then, the average size of the handled strings is to be consi- dered, as well as the percentage of unaligned strings. For unaligned strings of 16 or less characters, the fastest version would be an inlined version 5, running at c=8+2*n. The choice is yours.... ____________________________________________________________________________ ::::::::::.___ . ``` ::::::::::| _/__. |__ ____ . __. ____ ____ __. \\ :::::: |____ | __/_ _\_ (.___| .___) |__\_ (._) /___) | , ::::::::::/ / | \ | - | \ | - | - | \/| - | .=:::::::::/______|_____|_____| (___|_____|______|____|_____|===============. '=::::::::::==================| . ____ | (____====[ The C Standard lib ]===' :::::::::: | |------| - | :::::::::: | |______|______|CE . : C string functions: _strcpy by Xbios2 I. INTRODUCTION --------------- C syntax: char *strcpy(char *dest, const char *src); _strcpy copies string src to dest, stopping after the terminating null character has been moved, and returns dest. The 'traditional' way to do this is with the 'rep movs' instruction. BC 4.02 and kernel32 use it. The problem is that it is rather slow (BC _strlen takes 53+5.5*n cycles, lstrlenA takes 74+5.5*n cycles, and optimizing their code leads to 46+5.5*n cycles wher n the number of chars, see part I of these articles). This is because even though the 'rep movs' instruction is fast it needs to know the number of bytes to copy in advance. So, _strcpy includes a _strlen function before the actual copying, which is implemented through 'repne scasb', a slow instruction. In this article we will examine two 'modern' _strcpy functions, found in MSVCRT.DLL and Borland C++ Builder library. Those functions are (supposed to be) optimized for Pentium processors. If you're not familiar with optimization for Pentium processors, I suggest you read the document on Pentium optimization by Agner Fog (http://announce.com/agner/assem). II. STRCPY IN MSVCRT -------------------- ; c=39+1.75*n / 146 bytes strcpy proc push edi mov edi, [esp+8] ; dest mov ecx, [esp+0Ch] ; src test ecx, 3 jz short loop1 algn: mov dl, [ecx] inc ecx test dl, dl jz short one mov [edi], dl inc edi test ecx, 3 jnz short algn loop1: mov edx, -81010101h mov eax, [ecx] add edx, eax xor eax, -1 xor eax, edx mov edx, [ecx] add ecx, 4 test eax, 81010100h jz short nozero test dl, dl jz short one test dh, dh jz short two test edx, 0FF0000h jz short three test edx, 0FF000000h jz short four nozero: mov [edi], edx add edi, 4 jmp short loop1 ;... in the DLL, there is code here, not used by strcpy one: mov [edi], dl mov eax, [esp+8] pop edi retn two: mov [edi], dx mov eax, [esp+8] pop edi retn three: mov [edi], dx mov eax, [esp+8] mov byte ptr [edi+2], 0 pop edi retn four: mov [edi], edx mov eax, [esp+8] pop edi retn strcpy endp This procedure does the following: 1. Read arguments (src, dest) from stack 2. Check if src is aligned on a 4 byte boundary If not, copy byte after byte until src gets aligned 3. Loop Read one dword from src Test if there is a zero byte in the dword If no zero, copy dword to dest, loop back 4. Copy the remaining bytes 5. Return with dest in eax Actually the code above compiles to 130 bytes. The extra 16 bytes are added because between the loop and the 'one:' label there is the strcat function. So 4 conditional jumps take the 6-byte form, not the 2-byte one. This function takes 39+1.75*n. This means that the loop takes 7 cycles to execute (since each time the loop runs, it copies 4 bytes). Here is the explan- ation of the loop (U and V refer to the pipe the commands run in). loop1: mov edx, -81010101h ; U 1st mov eax, [ecx] ; V add edx, eax ; U 2nd xor eax, -1 ; V xor eax, edx ; U 3rd mov edx, [ecx] ; V add ecx, 4 ; U 4th test eax, 81010100h ; V jz short nozero ; U 5th ... nozero: mov [edi], edx ; U 6th add edi, 4 ; V jmp short loop1 ; U 7th The problem here is that both jumps run in the U pipe so they will not pair. Generally it's better to have an even number of instructions in each block of code. Just by moving one instruction this code will run in 6 cycles (i.e. 39+1.5*n cycles): loop1: mov edx, -81010101h ; U 1st mov eax, [ecx] ; V add edx, eax ; U 2nd xor eax, -1 ; V xor eax, edx ; U 3rd mov edx, [ecx] ; V test eax, 81010100h ; U jz short nozero ; V 4th ... nozero: mov [edi], edx ; U 5th add ecx, 4 ; V <<< moved instruction add edi, 4 ; U jmp short loop1 ; V 6th Everything pairs perfectly, and so 12 instructions only take 6 cycles. Pay attention to one thing: if 'add ecx, 4' and 'add edi, 4' are swapped, we get back to 7 cycles per loop, even though the pairing is the same. This is because the 'mov eax, [ecx]' instruction uses ecx to access memory, but ecx was changed in the previous clock cycle (add ecx, 4 / jmp short loop1). This causes an AGI stall (Address Generation Interlock), which wastes one cycle. As you 've noticed, _strcpy makes sure that the data read from src is aligned, because reading aligned dwords is faster. If src is aligned, the test only takes one cycle more, so it shouldn't trouble us. Yet, aligning src is not always a good idea. Suppose that you have an unaligned string and want to copy it in a buffer that is aligned. So what happens is that by aligning src we misalign dest. The problem is that misaligned writes are more expensive in cycles than misaligned reads. So _strcpy should either align dest or leave everything untouched. (not aligning src introduces an extremely small possibility of an access violation error, read section V below for details). III. STRCPY IN C++ BUILDER -------------------------- ; c=66+1.75*n / 146 bytes _strcpy proc push ebp mov ebp, esp mov ecx, [ebp+0Ch] ; src mov edx, [ebp+8] ; dest mov eax, ecx and eax, 3 jmp algn[eax*4] ; ------------------------------------ algn dd offset loop1 dd offset algn3 dd offset algn2 dd offset algn1 ; ------------------------------------ algn3: mov al, [ecx] or al, al jz short one mov [edx], al add ecx, 1 add edx, 1 algn2: mov al, [ecx] or al, al jz short one mov [edx], al add ecx, 1 add edx, 1 algn1: mov al, [ecx] or al, al jz short one mov [edx], al add ecx, 1 add edx, 1 loop1: mov eax, [ecx] or al, al jz short one or ah, ah jz short two test eax, 0FF0000h jz short three test eax, 0FF000000h jz short four mov [edx], eax add ecx, 4 add edx, 4 jmp short loop1 four: mov [edx], eax mov eax, [ebp+arg_0] pop ebp retn three: mov [edx], ax mov byte ptr [edx+2], 0 mov eax, [ebp+arg_0] pop ebp retn two: mov [edx], ax mov eax, [ebp+arg_0] pop ebp retn one: mov [edx], al mov eax, [ebp+arg_0] pop ebp retn _strcpy endp This function runs at 66+1.75*n cycles. The aligning is done in an awful way. If the aligning code is removed, we gain 39 cycles. By not using ebp, we save 4 more cycles. The loop takes 7 cycles, as shown below: loop1: mov eax, [ecx] ; U 1st or al, al ; U 2nd jz short one ; V or ah, ah ; U 3dr jz short two ; V test eax, 0FF0000h ; U 4th jz short three ; V test eax, 0FF000000h ; U 5th jz short four ; V mov [edx], eax ; U 6th add ecx, 4 ; V add edx, 4 ; U 7th jmp short loop1 ; V The first two instructions don't pair because 'or al,al' accesses a register changed by the previous instruction. Anyway, there are 13 instructions, not 12 as in the MSVCRT function. So, one instruction has to be removed. This instruct- ion is the unconditional jump (generally unconditional jumps can be avoided). Notice that if we get through to 'jz short four', the 'mov [edx], eax' instruct- ion will be executed anyway. So we rewrite the code as: loop1: mov eax, [ecx] ; U 1st inc ecx ; V or al, al ; U 2nd jz short one ; V or ah, ah ; U 3dr jz short two ; V test eax, 0FF0000h ; U 4th jz short three ; V mov [edx], eax ; U 5th add edx, 4 ; V shr eax, 24 ; U 6th jnz short loop1 ; V Notice that we use 'shr eax, 24' instead of 'test eax, 0FF000000h', because we no longer need the value in eax, and the 'shr' form is two bytes shorter. A modified version of this strcpy is the best I could come up with: IV. A FAST STRCPY ----------------- ; c=25+1.5*n / 80 bytes _strcpy proc mov ecx, [esp+8] ; src mov edx, [esp+4] ; dest push edx ; save return value test edx, 3 ; check if dest is aligned jz short loop1 algn: mov al, [ecx] inc ecx mov [edx], al inc edx test al, al jz short return test edx, 3 jnz short algn loop1: mov eax, [ecx] add ecx, 4 or al, al jz short one or ah, ah jz short two test eax, 0FF0000h jz short three mov [edx], eax add edx, 4 shr eax, 24 jnz short loop1 pop eax retn three: mov byte ptr [edx+2], 0 two: mov [edx], ax return: pop eax ; restore return value retn one: mov [edx], al pop eax ; restore return value retn _strcpy endp This function aligns dest instead of src, which, as discussed above, is faster. It can run at one cycle less, by reading the return value directly from the stack and not push/popping it, but it would take 8 bytes of code more. Slight modifications to this routine give us three other functions: - _stpcpy is exactly the same as _strcpy, only that it returns a pointer to the ending null char copied in dest - _strdup is a combination of _strlen, _malloc and _strcpy - _strcat is a combination of _strlen and _strcpy The MSVCRT _strcat actually counts the chars in src and then jumps into the code of _strcpy to perform the actual copying. Strangely, _strdup is implemented through 'repne scas' and 'rep movs'. It even has two 'repne scasb' instructions, one to calculate the length of the string to pass to the malloc function, and one to calculate the length of the string to copy, even though these two values are the same. So, even if coding in MS C++, using _strlen/_malloc/_strcpy is faster than using _strdup. V. IS IT FOOLPROOF? ------------------- To be honest, no. But programming is hardly ever so... First of all, any version of strcpy (of any function, generally) will fail if it tries to read or write data in a page it doesn't have access. This is hardly the case, but it can happen: -If there is no NULL character between the address of src and the last byte in the valid page. -If src is longer than the distance between dest and the last valid byte. The first case is extremely unlikely, because even if src was corrupt and had no terminating NULL, one is very likely to be encountered somewhere. The second case is also unlikely, and it means that the programmer didn't allocate enough space for dest. Anyway, corrupt data or not enough allocated space even if they don't cause an access violation, they cause problems. But the problem was created by the programmer, not the function. Yet there are also two cases where the strings ARE ok but an access violation occurs. These cases appear only on 'optimized' versions of strcpy, not on the 'rep movsb' version. The first case would appear in a strcpy function that doesn't align src (it either aligns dest or leaves both unchanged), if src is not aligned and it ends on a page boundary. Then the last read operation would try to read one to three bytes on a page it doesn't have access. This doesn't happen on the Builder and MSVCRT functions and happen on the one I give. Yet, it is really unlikely to happen, and aligning dest is faster. The second case would appear if dest points to a character in src (including the terminating NULL). What happens is that the NULL of src (and any other found) are overwritten, so no NULL is found and we finally get an access violation. This doesn't happen in the 'traditional' versions, because we only copy strlen(src) bytes. But even in those versions the last character copied wouldn't be a NULL, so dest wouldn't be a proper string. ____________________________________________________________________________ ____ . : . ``` | |___ : | ---- _____ ______ _____ | \\ | | |___|_ |______\ |---/ ._____/\____) -- (_) / |____) | , | ' | / | \___ __/ | | \/| | - |CE .==|________| (______|_______/ \==|_____/\__________|____| |______|===. '=========| |===========/----|___\==================[ The Unix World ]===' : X-Windows in Assembly Language: Part II by mammon_ OK, let's face it: you've seen the tedium of XLib, one *has* to use widgets in order to get any programming done in XWindows. 'But this is assembly langauge', the masochist might point out. 'Aren't widgets a little Visual-Basicy?' Not in the slightest. A widget is simply a C++ class exported for use --much like the windows API functions, only a little more object oriented...maybe a good comparison would be MFC or VCL. Xt, or 'X toolkit Intrinsics', is the interface that widget sets [such as Athena, Qt or GTK] use to interface with XLib. The Xt include files are in /usr/X11R6/include/X11, its libraries are in /usr/X11R6/lib, and its exported functions are all prefixed with "Xt". For the following examples I will be using the Atehna widget set, which is supplied with XFree86. The include files for Athena are in /usr/X11R6/include/Xaw and the libraries are in /usr/X11R6/lib. A barebones Xt/Athena app in C would run as follows: //====================================================================-xthell.c #include <X11/Intrinsic.h> #include <X11/StringDefs.h> #include <X11/Xaw/Command.h> void Quit(w, client_data, call_data) //CallBack function Widget w; XtPointer client_data, call_data; { exit(0); } main(argc,argv) int argc; char **argv; { XtAppContext app_context; Widget ShellWidge, ButtnWidge; ShellWidge = XtVaAppInitialize( &app_context, "toplevel", NULL, 0, &argc, \ argv, NULL, NULL); ButtnWidge = XtVaCreateManagedWidget("hellbutton", commandWidgetClass, \ ShellWidge, NULL); XtAddCallback(ButtnWidge, XtNcallback, Quit, 0); XtRealizeWidget(ShellWidge); XtAppMainLoop(app_context); } // compile with cc -o xthell xthello.c -lXmu -lXaw -lXt -lX11 -L/usr/X11R6/lib //=========================================================================-EOF Pretty ugly, eh? This boils down to the following steps: 1) Create the top-level 'Canvas' widget [the window] ShellWidge = XtVaAppInitialize( &app_context, "toplevel", ..... ) 2) Create the button widget ButtnWidge = XtVaCreateManagedWidget("hellbutton", ..... ) 3) Register a callback for the button XtAddCallback(ButtnWidge, XtNcallback, Quit, 0); 4) Show the top-level widget XtRealizeWidget(ShellWidge); 5) Transfer control to the Xt message loop XtAppMainLoop(app_context); The most interesting thing about Xt programming is in fact the callbacks. Instead of writing a message processing loop, you register a callback function for each widget and then pass control to Xt, which processes the messages for you and dispatches each message to the appropriate callback function. The call- back receives a pointer to the widget that sent the message [the same argument as passed to XtAddCallback], a client_data pointer [the last argument passed to XtAddCallback, used to pass data from the main routine to the callback], and a call_data pointer, which contains information from the message [such as cursor or scrollbar position]. The calls themselves are pretty straightforward: XtVaAppInitialize initializes [sic] the X app and takes as its arguments a pointer to an XtAppContext structure, the class name of the application, application-specific command line options {args 3 and 4], argc, argv, a default resource-settings file, and a NULL to terminate the arguments list [the XtVaAppInitialize function actually takes a number of different parameters]; it returns a handle to the 'canvas' or 'top-level' widget, on which all other widgets will be painted. XtVaCreateManagedWidget is used to create any of the Xt widgets [Athena, GTK, etc], and takes as its parameters the instance name, the widget class, the parent widget, and a NULL to terminate the arguments list; it returns a pointer to the created widget. XtAddCallback is used to register a callback function with a specific widget; it takes as its parameters a pointer to the Widget, the callback type, the function being registered, and a pointer to client_data which will be passed to the callback. XtRealizeWidget is simply used like ShowWindow in Windows; it takes a single parameter which is the widget to 'show'; it displays that widget and its children. XtAppMainLoop takes the current application context [which was filled with the XtVaAppInitialize call] and turns control over to the Xt message processing loop. Note that the program does not have to return; in this example, the exit call is placed in the callback function. Here is the same application written for NASM: ;===================================================================-xthell.asm BITS 32 GLOBAL main GLOBAL bail EXTERN XtVaAppInitialize EXTERN XtVaCreateManagedWidget EXTERN XtAddCallback EXTERN XtRealizeWidget EXTERN XtAppMainLoop EXTERN commandWidgetClass EXTERN exit SECTION .data AppContext DD 0 ShellWidge DD 0 ButtnWidge DD 0 ARGC times 128 DB 0 ClassName DB "toplevel",0 ButtnName DB "hellbutton",0 XtNcallback DB "callback",0 ;XtNcallback SECTION .text bail: pop eax ; Xt_Pointer call_data pop ebx ; Xt_Pointer client_data pop ecx ; Xt_Pointer widget push dword 0 call exit ;-------------------------- main main: mov eax, esp push dword 0 ;Number of Args push dword 0 ;Args push dword 0 ;Fallback Resources push dword 0 ;argv push dword ARGC ;&argc push dword 0 ;Number of Options push dword 0 ;Options Array push dword ClassName ;Class Name (String) push dword AppContext ;Application Context (Ptr) call XtVaAppInitialize add esp, 36 mov [ShellWidge], eax push dword 0 push eax ;Button parent (ShellWidge) push dword [commandWidgetClass] ;Button widget type push dword ButtnName ;Button class name call XtVaCreateManagedWidget add esp, 16 mov [ButtnWidge], eax push dword 0 ; client_data push dword bail ;CallBack function push dword XtNcallback ; callback type push eax ;CallBack widget (ButtonWidge) call XtAddCallback add esp, 16 push dword [ShellWidge] ;Widget Handle call XtRealizeWidget add esp, 4 push dword [AppContext] call XtAppMainLoop add esp, 4 ret ;==========================================================================-EOF This can be compiled with the following commands: nasm -f elf xthell.asm gcc -o xthell xthell.o -lXaw -lXt -lX11 -L/usr/X11R6/lib Most of the operation is the same as the C file; naturally you must push dword 0's instead of NULLs...and do not forget to push the arguments in reverse order and to clean up the stack afterwards; this is C after all and not stdcall is used in Windows. You will have to study up on Athena to learn what the names of the various widgets are ... I found it helpful to use grep extern /usr/X11R6/include/Xaw/* for a general overview. Note that the class names are strings in assembly; also each of the various 'handles' [widgets, contexts, etc] is simply defined with a DD 0 -- your generic 32-bit variable. The Callback type turned out to be a string defined in the Xt header files; I simply recreated it above. Another interesting gemis the need to call 'exit' rather than simply using a 'ret' as you would in console mode; the latter causes segmentation faults, most likely due to the XtAppMainLoop call. In addition you *must* provide a pointer to ARGC whether you check the command line or not; hence the 'ARGC: DB 128'. In case you didn't notice, the Xt asm example is huge and clunky, with a lot of not-so-obvious variable definitions. Having included a lengthy introduction to NASM macros in this issue, I took the opportunity to create an xt.mac file which will take some of the burden off of experimenting with small Xt apps. The InitXt and RegisterCallback macros probably are not ready for prime-time just yet, but they will do for testing purposes. ;=======================================================================-xt.mac %macro CLASS 2 %1: DB %2,0 %endmacro %macro WDGTPTR 1 %1: DD 0 %endmacro %macro CONTEXT 1 %1: DD 0 %endmacro %macro CHARSTR 2 %1: DB %2,0 %endmacro %define WIDGET EXTERN %define XLibAPI EXTERN %define XtAPI EXTERN %define PUBLIC GLOBAL %define NULL dword 0 %define TERM_VARARGS dword 0 %macro InitXt 2 SECTION .data CONTEXT AppContext CLASS XtShell, "XtShell" SECTION .text EXTERN XtVaAppInitialize push dword 0 ;Number of Args push dword 0 ;Args push dword 0 ;Fallback Resources push dword 0 ;argv push dword %2 ;&argc push dword 0 ;Number of Options push dword 0 ;Options Array push dword XtShell ;Class Name (String) push dword AppContext ;Application Context (Ptr) call XtVaAppInitialize add esp, 36 mov [%1], eax %endmacro %macro XtMsgLoop 0 EXTERN XtAppMainLoop push dword [AppContext] call XtAppMainLoop add esp, 4 %endmacro %macro RegisterCallback 1 SECTION .data CBType: DB "callback",0 SECTION .code push NULL ; push dword %1 ;CallBack function push dword CBType ; push eax ;CallBack parent (ButtonWidge) call XtAddCallback add esp, 16 %endmacro %macro CALLBACK 1 SECTION .data Call_Data_%1: DD 0 Client_Data_%1: DD 0 Widget_%1: DD 0 GLOBAL %1 SECTION .text %1: pop eax mov [Call_Data_%1], eax pop ebx mov [Client_Data_%1], ebx pop ecx mov [Widget_%1], ecx %endmacro %define ENDCALLBACK nop %macro ENTRYPOINT 1 GLOBAL %1 %1: %endmacro ;==========================================================================-EOF Most of the macro file should be readily apparent if you are familiar with the NASM macro facility. I did take the opportunity to clean up the callback function, so that the parameters to the callback are saved in variables, but for the most part it does the same as the equivalent code in the preceding asm example. Now the xthell.asm sample will look as follows: ;===================================================================-xthell.asm BITS 32 %INCLUDE "xt.mac" ;========================================================XTRN===== XtAPI XtVaCreateManagedWidget XtAPI XtAddCallback XtAPI XtRealizeWidget WIDGET commandWidgetClass EXTERN exit ;========================================================DATA===== SECTION .data ;------------ WDGTPTR ptrShell WDGTPTR ptrButton CLASS XHELL, "XHell" CLASS HellButton, "HellButton" CallbackType DB "callback",0 ;XtNcallback ARGC times 128 DB 0 ;========================================================CODE===== SECTION .text ;------------ CALLBACK bail push dword 0 call exit ENDCALLBACK ENTRYPOINT main InitXt ptrShell, ARGC push TERM_VARARGS push eax ;Button parent (ShellWidge) push dword [commandWidgetClass] ;Button widget type push dword HellButton ;Button class name call XtVaCreateManagedWidget add esp, 16 mov [ptrButton], eax RegisterCallback bail push dword [ptrShell] ;Widget Handle call XtRealizeWidget add esp, 4 XtMsgLoop ret ;==========================================================================-EOF Much prettier and hey, only twice as long as the C version! ;) Next issue I will dwell on Xt/Athena a little longer and come up with some more practical methods of automating the coding process. ____________________________________________________________________________ ____ ___ _____ _| |_ ____ . ``` .__\ /__ ______ _) /.\ _/__ ___ ______\_ (_. | \\ | \/ | | \/ | \ | | | - | |CE , .==|________|______|______|_______|_______|_______| |======================. '=================================================| :=[ Virtual Machines ]=' An Intro to the Java Virtual Machine by Cynical Pinnacle For awhile C/C++ reigned supreme and nothing challenged it but then along comes Java, creating a splash, and causing outright corporate warfare to claim right of ownership. Strangely enough the result of this war has not been dead bodies but buckets and buckets of API's all given away for free. Just stop by and take a look at Java's Official Website (http://java.sun.com) and what do you find a good development kit with compiler, symbolic debugger, disassembler, complete toolkit for creating GUI's, built in support for compression, encrypt- ion, http, ftp, SMTP, POP3, IHMP, and more. Wow! But how can we take advantage of all this? First we have to step back a look at what Java really is. Because one of the main goals of Java is platform independence (both from the chip and the OS). The JVM, which supports Java, has to be both a chip and an OS. If any of us (well lets say us programmers) were to design a chip and a OS in one, we would fill it with features like built in security, automatic dependency resolution and linking, network support, video and audio acceleration, along with more common things such as built in data types (ints, floats, arrays, and objects ), support for local variables, exception handling, support for debugging, and on and on. This is really what Java is because it does all the things I mentioned above and more. This is what Sun has tried to do - design an "Ideal" environ- ment for writing and executing code, or write once run anywhere. But this wealth of features comes at the price of speed of execution and further distance further from the native code of your machine (unless you are running on a real Java machine). And if you are like me the latter hurts as much as the former. Still there are a alot of really appealing things about Java. And so the challenge is to use these appealing features on our own terms. We can do this by programming at a lower level to at least touch the native language of the Java Virtual Machine (Java Assembly language!!) The JVM: I am going to take a programmers view of the JVM and say it is simple because from our perspective it is. But for the sake of completeness I will list the other components of the JVM: Memory Manager: This is the unit responsible for the famous garbage collect- ion and heap management. I say clean up your own garbage. Error/Exception Manager: Handles unforseen conditions. Native Method Support: This is to allow you to call WaitForSingleObject from within Java. It is responsible for loading DLLs, resolving entry points and executing them. Note Java only supports dynamic linking. Threads: Java doesn't have to worry so much about memory because it is all allocated on the stack. Each thread gets its own stack frame (chunk of stack for its personal use). Switching threads in a stack based machine is easy. You just make the threads stack top the machine's stack top and go. Class Loader: This is just like the loaders in NT and 95. It brings up class files from the disk initializes them (headers memory etc) and passes execution to the classes entry point. Security Manager: Want to find out whether or not you can do something. Ask him. Execution Engine: This is where the JVM opcodes are translated into native opcodes. This is the part of the JVM which a low level Java programmer will interact with most. The Execution engine has a much simpler structure than a Pentium. At its heart is a stack where instuctions are executed. Note that the JVM has no registers, which is more a trait of Virtual machines than Java. Basically opcodes and operands are popped off of the stack and executed by the VM bases upon a mapping between Java opcodes and Native opcodes. In addition, there is built in support for local variables (more later). And as mentioned before the stack is symbollically divided up by the JVM. Each method (read method to mean function) gets it own stack frame which is allocated when the method starts and deallocated when the method exits (sound familiar eg x86 push bp -- mov bp, sp .. leave instructions). The Execution engine understands the int, long, float, double, byte, char, short, reference (eg pointer) and the instructions it understands are strongly typed (for example: there is an instruction called iload which loads an integer from a local variable onto the stack (like mov eax, myLocalVariable) but there is also a dload, lload, and fload for doubles, longs, and floats). Now with a little background it is time to learn or burn but first we need some tools. First you will need to download the JDK from http://java.sun.com (I recommend 1.1.x and the current I think is 7) you will also need a Java assembler which is called Jasmin also free at http://www.cat.nyu.edu/meyer/jasmin/ and you will need a good editor (I heartily recommend Visual SlickEdit 3.0 or 4.0 at http://www.slickedit.com ). You can get docs about the JVM from http://www.javasoft.com/ And the best book I have found on the JVM (and Jasmin) is "Java Virtual Machine" by Jon Meyer and Troy Denning. There are several useful tools with the JDK comes a program called javap which is a Java .class file disassembler! With the -c switch it will produce JASM code from any .class file. Note that reversing Java programs is not nearly as hard as x86. Try it. Take some Java .class file you have laying around ( Keep it small so you don't get confused ) and disassemble it then you will see what I mean. I can see a very difficult future for Java Shareware programmers. There are also many other 3rd party tools out there for mainpulating .class files. With all of those tools installed we are ready to write the mandatory "Hello from JASM!" program. First comes the Java Assembler code followed by its Java equivilant. ;*************************************************************** ;Export the class name so Java can find it class public Hello ;Simplest class to derive from super java/lang/Object ; General facts ; .method - means this is a function ; ; public - means you are it is visible externally (low ; level Object Orientedness) ; ; Java always uses explicit paths in Unix notation hence you ; end up with a lot of notation like java/lan/Object. ; java/lang/Object is the path to the superclass. In a lot ; of the functions there is a V, this means it returns void ; the syntax for specifying arguments is strange. I ; recommend reading Jonathan Meyer's Jasmin documentation ; http://www.cat.nyu.edu/meyer/jasmin/ ; This method is called init and is in every class ; all that is done here is to push the contents of local ; variable 0 onto the stack and call the superclass's ; (Object) init method. Local variable 0 is always ; the reference (pointer) to the equivalent of C++'s this .method public <init>()V aload_0 ; This just calls the superclasses init method invokenonvirtual java/lang/Object/<init>()V ; Get out return end method ; Here is the main function which is publically visible ; is static and thus shared by all instances of the class, ; it takes one argument of type [ljava/lang/String, which ; is an array of strings eg: argv**, and returns void hence ; the V. .method public static main([Ljava/lang/String;)V ; Delcare your stack memory usage .limit stack 2 .limit locals 1 ; Push 5 onto the stack and store it in local variable 1 bipush 5 istore_1 LoopTop: ; These next two lines put the parameters for the println ; function onto the stack in the right order. Java uses ; the pascal calling convention (push left to right and ; the callee cleans up). First a reference to the stream ; object doing the work is pushed onto the stack. Next a ; reference to the string to be printed is pushed. ;get the pointer to the stream object and push it getstatic java/lang/System/out Ljava/io/PrintStream; ;get the pointer to the string and push it ldc "Hello from JASM!" ;call println invokevirtual java/io/PrintStream/println(Ljava/lang/String;)V ; These next three lines are the loop condition ; iinc adds -1 to local variable 1. iload_1 pushes ; local variable 1 onto the stack and ifne compares it to ; zero (just like jnz in x86). If it is not equal to zero ; we jump to LoopTop iinc 1 -1 iload_1 ifne LoopTop ; Go home return ;Declare the end of the function end method ; ; Java equivalent: ; ; public class Hello ; { ; public static void main ( String args[] ) ; { ; int i; ; for( i = 0;i < 5; i++ ) ; { ; System.out.println( "Hello from JASM!" ); ; } ; ; } ; ; } ;*************************************************************** To run this JASM sample cut out the stuff between the *'s and save it to a file called Hello.j. Next type jasmin Hello.j. This should generate a Hello.class file. Now type java Hello and you should see the string "Hello from JASM!" printed out 5 times. This is enough to get you started poking around in the JVM and looking a little closer at the .class files you find lying around ;) . More to come? ____________________________________________________________________________ ____ . __ ``` ._| __/___ : | ___ _____ __) |_ _____ \\ | |_____ |____|__ |_______ ______ ______ _) _ |_\_ __)___| __/_____ , | | ( | | __ | __ | ----' | | |_______ |CE .=|_________| /_______|_______| _____| _____|________|____________________|=. '===========| |===========|___|==|___|===================[ snippets ]===' NumFactors by Troy Benoist ;Summary: Routine to determine the number of factors for a 16-bit value ;Compatibility: All DOS versions/8088+ instructions ;Notes: 22 BYTES Input: AX = Value to check for number of factors ; * If CX is 2, check value in AX is prime. mov bx,1 ;Test=1 (Test is the testing value of each theoretical ;factor of AX, from 1-AX). xor cx,cx ;Count=0 (Count is running total with # of factors for AX). ChkFctr: xor dx,dx <--- ;Need to divide DX:AX by BX, but DX is not used-- clear it. push ax | ;Dividing by BX puts quotient in AX, but quotient is not | ;needed, and we need to keep the value to check, so save it. div bx | ;Divide DX:AX by BX. Remainder is in DX. pop ax | ;Restore value to check into AX. cmp dx,0 | ;Is remainder=0? (Did Test divide evenly into check value?) jnz NC -----| | ;If not, Count remains unchanged. inc cx | | ;If so, factor found, so Count=Count+1. NC: inc bx <---| | ;Test=Test+1. cmp bx,ax | ;Is Test greater than check value? jbe ChkFctr___| ;If not, go back and check next Test factor. ______________________________________________. __________________________ ``` | . ``` ._____ ___ ____ ___ ___ ____ : | ______ \\ | __/__| |______\_ (_) | (____\_ ( ______ |________ _) __ |___. , | | | - | | | - | _ / | __ | ----' |CE .==|_______|--)___|______|___|___|_______|--(_______|_______ |___________|===. '=============================================[ issue | _____| challange ]===' |___\ Convert ASCII hex to binary in 6 bytes by mammon_ [and help] The Challenge ------------- Write a routine for converting ASCII hex to binary in 6 bytes. The Solution ------------ Well, actually, I cheated: I found the following text on the internet a few months ago and decided to see if I could beat it: ===================================================================-Asc2hex.txt An efficient algorithm for converting ASCII hex to binary Ken Sykes (72177,141) David Ogle (75676,2612) There is a well-known algorithm for converting a binary number between one and fifteen to its equivalent hex form in ASCII that only requires four assembly language instructions. Assuming that the number to convert is in AL, the following sequence performs the conversion: add al,90h daa adc al,40h daa This instruction sequence is, as far as we know, the shortest (16 cycles) self-contained routine to convert a binary number to hex. Inspired by this code and the fact that a similar routine to convert ASCII hex to binary would be useful, we came up with two algorithms that convert an ASCII hex digit to binary in five assembly instructions or less. The first algorithm takes advantage of an undocumented feature of the 8088. The AAM instruction (Ascii Adjust for Multiply) divides AL by 10, placing the remainder in AL and the quotient in AH. The opcode for AAM is: 0D4h,0Ah. It's no coincidence that the second byte is 10 - the 8088 uses the second byte of the opcode as the divisor! The same rule applies to the AAD (Ascii Adjust for Division) instruction. With this in mind, a conversion routine goes as follows (assuming the ASCII digit is in AL and in the range '0..9,A..F'): sub al,'0' aam 16 aad 9 The only problem is the Microsoft Assembler does not accept this form. By placing the opcodes in data statements, however, the following code will assemble properly: sub al,'0' db 0D4h,10h db 0D5h,09h At three instructions and six bytes of code space, We are reasonably sure this is the shortest self-contained sequence to perform the conversion. The only drawbacks are the use of non-standard opcodes and the execution time (147 cycles!). The second algorithm, loosely-based on this one, relieves these restrictions. The second algorithm makes exclusive use of fast instructions to perform the conversion (again, AL holds the digit to convert): add al,40h cbw and ah,09h add al,ah and al,0Fh While two additional instructions are required, the routine executes in 17 cycles. We are reasonably sure this is the fastest-executing self- contained code to perform the conversion. It has the added benefit of handling the ASCII values 'a'..'f'. These algorithms will hopefully trim down the execution times of your programs, and we welcome any suggestions or improvements on our code. Happy Hacking! ===========================================================================-EOF Sadly, I was unable to come up with a smaller version, or a faster one. Tinara posted a similar solution to the APJ Message Board: SUB AL, 30h AAM 10h ; db D4h, 10h AAD 09h ; db D5h, 09h ...so he gets kudos for uncovering by work what I managed by stealth. As for next issue, I haven't had time to prepare a challenge, but I'm sure one will crop up in the next month or so. Next Issue Challenge -------------------- None so far... submissions welcome. ::/ \::::::. :/___\:::::::. /| \::::::::. :| _/\:::::::::. :| _|\ \::::::::::. :::\_____\:::::::::::.......................................................FIN