💾 Archived View for gemini.spam.works › mirrors › textfiles › computers › DOCUMENTATION › a10.txt captured on 2022-07-16 at 23:33:33.
⬅️ Previous capture (2022-06-12)
-=-=-=-=-=-=-
CHAPTER 10 RELOCATION AND LINKAGE A86 allows you to produce either .COM files, which can be run immediately as standalone programs, or .OBJ files, to be fed to the MS-DOS LINK program. In this chapter I'll discuss .OBJ mode of A86. .OBJ Production Made Easy I'll start by giving you the minimum amount of information you need to know to produce .OBJ files. If you are writing short interface routines, and do not want to concern yourself with the esoterica of .OBJ files (segments, groups, publics, etc.), you can survive quite nicely by reading only this section. There are two ways you can cause A86 to produce a .OBJ file as its object output. One way is to explicitly give .OBJ as the output file name: for example, you can assemble the source file FOO.8 by giving the command "A86 FOO.8 FOO.OBJ". The other way is to specify the switch +O (letter O not digit 0). This is illustrated by the invocation "A86 +O FOO.8", which will have the same effect as the first invocation. My design philosophy for .OBJ production is to accommodate two types of user. The first type of user is writing new code, to link to other (usually high level language) modules. That person should be able to write the module with a minimum of red tape, and have A86 do the right thing. The second type of user has existing modules written for Intel/IBM assemblers, and wants to port them to A86. A86 should recognize and act upon all the relocation directives (SEGMENT, GROUP, PUBLIC, EXTRN, NAME, END) given. The assembly should work even if several files, assembled separately under the Intel/IBM assembler, are fed to a single A86 assembly. You'll see if you read on through this entire chapter that the multiple-files requirement causes A86 to interpret some of the relocation directives a little differently (while achieving compatible results). Let's suppose you're writing new code: for example, an interface routine to the "C" language, that multiplies a 16-bit number by 10. "C" pushes the input number onto the stack, before calling your routine. Your code needs to get the number, multiply it by 10, and return the answer in the AX register. You can code it: _MUL10: ; "C" expects all public names to start with "_" PUSH BP ; "C" expects BP to be preserved MOV BP,SP ; we use BP to address the stack MOV AX,[BP+4] ; fetch the number N, beyond BP and the ret addr ADD AX,AX ; 2N MOV BX,AX ; 2N is saved in BX ADD AX,AX ; 4N ADD AX,AX ; 8N ADD AX,BX ; 8N + 2N = 10N POP BP ; BP is restored RET ; go back to caller 10-2 These 11 lines can be your entire source file! If you name the file MUL10.8, A86 will create an object file MUL10.OBJ, that conforms to the standard SMALL model of computation for high level languages. If you use RETF instead of RET (thus, by the way, getting the operand from BP+6 instead of BP+4), the object module will conform to the standard LARGE model of computation. All the red tape information required by the high level language is provided implicitly by A86. I'll go through this information in detail later, but you should need to read about it only if you're curious. What happens if you need to access symbols outside the module you're assembling? If the type of the symbol is correctly guessed from the instruction that refers to it, then you can simply refer to it, and leave it undefined within the module. For example, if A86 sees the instruction CALL PRINT with PRINT undefined, it will assume that PRINT is a NEAR procedure. If PRINT is never defined within the module, A86 will act as if you declared PRINT via the directive EXTRN PRINT:NEAR. The address of PRINT will be plugged into your instruction by LINK when it combines A86's .OBJ file with the high level language's .OBJ files, to make the final program. In general, the undefined operand to any CALL or JMP instruction is assumed to be NEAR. The second (source) operand to a MOV or arithmetic instruction is assumed to be ABS (i.e., an immediate constant). An undefined first (destination) operand is assumed to be a simple memory variable, of the same size (BYTE or WORD) as the register given in the second operand. If your external symbol does not comply with these guidelines, you need to declare it with an EXTRN before you use it. (You can also use EXTRN to declare types of non-complying forward references within your module, as you'll see later.) If you'd like to link the MUL10 procedure to Turbo Pascal V4.0 or later, you need to append the line CODE SEGMENT PUBLIC to the top of the program, to name the program segment according to Turbo Pascal's expectations. You may dispense with the leading underscore in the name MUL10-- Turbo Pascal does not require or expect it. At this point, if you're a casual user, I think you've read enough to get going! Read further only if you wish; or if you get stuck, and need to master the esoterica. 10-3 Overview of Relocation and Linkage When you assemble a program directly into a .COM file, the program has just two forms: the source program, that you can understand, and the .COM file, that the computer can "understand" (i.e., execute). A .OBJ file is an intermediate format: neither you nor the (executing) computer can make sense out of a .OBJ file; only programs like LINK interpret .OBJ files. The purpose of a .OBJ file is to allow you to assemble or compile just a part of a program. The other parts (also in the form of .OBJ files) can be produced at a different time; often by a different assembler or compiler, whose source files are in a different language. It's easy to see where the word "linkage" comes from: the LINK program puts the pieces of a program together. The "relocation" comes because the assembler or compiler that makes a given program piece doesn't know how many other pieces will come before it, or how big the other pieces will be. Each piece is constructed as if it started at location 0 within the program; then LINK "relocates" the piece to its true location. Many of the relocation features of 86 assembly language are couched in terms of LINK's point of view, so we must look at the way LINK sees things. LINK calls a .OBJ file an "object module", or just "module". Each module has a NAME, that can be referred to when LINK issues diagnostic messages, such as error messages and symbol maps. If a program symbol is used only within a single module, it does not need to be given to LINK, except possibly to pass along to a symbolic debugger. On the other hand, if a program symbol is defined in one module and referenced in other modules, then LINK needs to know the name of the symbol, so it can resolve the references. Such a symbol is PUBLIC in the module in which it is defined; it is "external" in the other modules, containing references to it. Finally, exactly one module in a program must contain the starting location for the program; that module is called the "main module", and it must supply the starting address (which is not necessarily at the beginning of the module). In the 86 family of microprocessors, the LINK system also does much to manage the memory segments that a program will fit into, and get its data from. The (grotesquely ornate) level of support for segmentation was dictated by Intel when it specified (and IBM and the compiler makers accepted) the format that .OBJ files will have. I attended the fateful meeting at Intel, in which the crucial design decisions were made. I regret to say that I sat quietly, while engineers more senior than I applied their fertile imaginations to construct fanciful scenarios which they felt had to be supported by LINK. Let's now review the resulting segmentation model. 10-4 The parts of a program, as viewed by LINK, come in three different sizes: they can be (1) pieces of a single segment, (2) an entire single segment, or (3) a sequence of consecutive segments in 86 memory. Size (1) should have been called something like FRAGMENT, but is instead called SEGMENT. Size (2) should have been called SEGMENT, but is instead called GROUP. Size (3) should have been called "group", but is instead called "class". Let me cling to the sensible terminology for one more paragraph, while I describe the worst scenario Intel wanted to support; then when I discuss individual directives, I'll regretfully revert to the official terminology. The scenario is as follows: suppose you have a program that occupies about 100K bytes of memory. The program contains a core of 20K bytes of utility routines that every part of the program calls. You'd like every part of the program to be able to call these routines, using the NEAR form to save memory. By gum, you can do it! You simply(!) slice the program into three fragments: the utility routines will go into fragment U, and the rest of the program will be split into equal-sized 40K-byte fragments A and B. Now you arrange the fragments in 8086 memory in the order A,U,B. The fragments A and U form a 60K-byte block, addressed by a segment register value G1, that points to the beginning of A. The fragments U and B form another 60K-byte block addressed by a segment register value G2, that points to the beginning of U. If you set the CS register to G1 when A is executing, and G2 when B is executing, the U fragment is accessible at all times. Since all direct JMPs and CALLs are encoded as relative offsets, the U-code will execute direct jumps correctly whether addressed by G1 with a huge offset, or G2 with a small offset. Of course, if U contains any absolute pointers referring to itself (such as an indirect near JMP or CALL), you're in trouble. It's now been over a decade since the fateful design meeting took place, and I can report that the above scenario has never taken place in the real world. And I can state with some authority that it never will. The reason is that the only programs that exceed 64K bytes in size are coded in high level language, not assembly language. High level language compilers follow a very, very restricted segmentation model-- no existing model comes remotely close to supporting the scheme suggested by the scenario. But the 86 assembly language can support it-- the directives "G1 GROUP A,U" and "G2 GROUP B,U", followed by chunks of code of the appropriate object size, headed by directives "A SEGMENT", "B SEGMENT", and "U SEGMENT". The LINK program is supposed to sort things out according to the scenario; but I can't say (and I have my doubts) if it actually succeeds in doing so. The concept of "class" was added as an afterthought, to implement the more sensible and usable features that outsiders thought GROUPs were implementing; namely, the ability to specify that different (and disjoint!) segments occur consecutively in memory. This allows programs to be arranged in a consistent manner-- for example, with all program code followed by all static data segments followed by all dynamically allocated memory. 10-5 The NAME Directive Syntax: NAME module_name The NAME directive specifies that "module_name" be given to LINK as the name of the module produced by this assembly. The symbol "module_name" can be used elsewhere in your program without conflict: it can even, if you like, be a built-in assembler mnemonic (e.g. "NAME MOV" is acceptable)! If you do not provide a NAME directive, A86 will use the name of the output object file, without the .OBJ extension. If you provide more than one NAME directive, A86 will use the last one given, with no error reported. The PUBLIC Directive Syntax: PUBLIC sym1, sym2, sym3, ... PUBLIC The PUBLIC directive allows you to explicitly list the symbols defined in this assembly, that can be used by other modules. If you do not give any PUBLIC directives in your program, A86 will use every relocatable label and variable name in your program, except local labels (the redefinable labels consisting of a letter followed by digits: L7, M1, Q234, etc.). Symbols EQUated to constants, and symbols defined within structures and DATA SEGMENTs, are not implicitly declared PUBLIC: you have to explicitly include them in a PUBLIC directive. A86 maintains an internal flag, telling it whether to figure out for itself which symbols are PUBLIC, or to let the program explicitly declare them. The flag starts out "implicit", and is set to "explicit" only if A86 sees a PUBLIC directive with no names at all, or a PUBLIC directive containing at least one name that would have been implicitly made PUBLIC. If you are writing new code, you'll probably want to keep the flag "implicit". You use the PUBLIC directive only for those symbols which have the form of local labels, but aren't (e.g., a memory variable I1987 for 1987 income); and for absolute values that are globally accessed -- e.g. specify "PUBLIC OPEN_FILES_LIMIT" for a symbol defined as "OPEN_FILES_LIMIT EQU 20". If you are porting existing code, that code will already have PUBLIC directives in it, and A86 will go to "explicit" mode, duplicating the functionality of other assemblers. The PUBLIC directive with no names is used to force "explicit" mode, thus causing (if there are no further PUBLICs with names) the .OBJ file to declare no symbols PUBLIC. 10-6 There is another side effect to the PUBLIC directive: if a symbol is declared PUBLIC in a module, it had better be defined in that module. If it isn't then A86 includes it in the .ERR listing of undefined symbols in the module, and suppresses output of the object file. The EXTRN Directive Syntax: EXTRN sym1:type, sym2:type, ... where "type" is one of: BYTE WORD DWORD QWORD TBYTE FAR or synonymously: B W D Q T F or: NEAR ABS The EXTRN directive allows you to attach a type to a symbol that may not yet be defined (and may never be defined) within your program. This is often necessary for the assembler to generate the correct instruction form when the symbol is used as an operand. All the possible types except ABS are defined elsewhere in the A86 language, but I list them again here for convenience: B or BYTE: byte-sized memory variable W or WORD: word (2 byte) sized memory variable D or DWORD: doubleword (4-byte) sized memory variable Q or QWORD: quadword (8-byte) sized memory variable T or TWORD: 10-byte-sized memory variable NEAR: program label accessed within a segment FAR: program label accessed from outside this segment ABS: an absolute number (i.e., an immediate constant) An example of EXTRN usage is as follows: suppose there is a word memory variable IFARK in your program. The variable might be declared at the end of the program; or it might be defined in a module completely outside of this program. Without an EXTRN directive, A86 will assemble an instruction such as "MOV AX,IFARK" as the loading of an immediate constant IFARK into the AX register. If you place the directive "EXTRN IFARK:W" at the top of your program, you'll get the correct instruction form for MOV AX,IFARK-- moving a word memory variable into the AX register. A86 will allow more than one EXTRN directive for a given symbol, as long as the same type is given every time. A86 will even allow an EXTRN directive for a symbol that has already been defined, as long as the type declared is consistent with the symbol's definition. These allowances exist so that you can assemble multiple files written for another assembler, that had been fed separately to that assembler. 10-7 Note that EXTRN is viewed quite differently by A86 than by other assemblers. In fact, if it weren't for those other assemblers, I'd use the mnemonic DECLARE instead of EXTRN. A86 doesn't really use EXTRN to determine which symbols are external-- it uses those symbols that are undefined at the end of assembly. As I stated earlier in the chapter, an undefined symbol can be referenced without being declared via EXTRN. Conversely, a defined symbol can be declared (and redeclared) via EXTRN; being defined, such a symbol will not be specified "external" in the .OBJ file. Because EXTRN is useful in forward reference situations, it is now recognized even when A86 is assembling a .COM file. For those of you who are accustomed to the more traditional use of EXTRN, and who do not like external records to be created "behind your back", A86 offers the "+x" option. If you include "+x" in the program invocation, A86 will require that all undefined symbols be explicitly declared via an EXTRN. Any undefined, undeclared symbols will be included in the .ERR listing of undefined symbols, and object file output will be suppressed. MAIN: The Starting Location for a Program I've already stated that exactly one module in a program is the "main" module, containing the starting address of the entire program. In A86 when assembling .OBJ files, the starting address is given by the label MAIN. You simply provide the label "MAIN:" where you want the program to start. The module containing MAIN is the main module. The END Directive Syntax: END END start_addr The END directive is used by other assemblers for two purposes, both of which are now a little silly. The first purpose is to signal the end of assembly. This was necessary back in the days when source files were input on media such as paper tape: you had to tell the assembler explicitly that the content of the tape has ended. Today the operating system can tell you when you've reached the end of the file, so this function is an anachronism. The second purpose of END is, nonsensically, to allow you to specify the starting location of the program. I suppose the person who wrote the first assembler back in the 1950's was too short on memory to implement a separate START directive, or a MAIN label like A86 has, and decided to let END do double duty. I've always considered the example "END START" to have an Alice-in-Wonderland quality; it is fuel for the high-level-language snobs who like to attack assembly language. Please defeat the snobs, and use "MAIN:" if you are writing new code. 10-8 For compatibility, A86 treats "END start_addr" exactly the same as if you had coded "MAIN EQU start_addr". Note that if you want your program to assemble under both A86 and that other assembler, you can specify "END MAIN"-- A86 treats MAIN EQU MAIN as a legal redefinition of the symbol MAIN. A86 ignores END when there is no starting-address operand, thus allowing assembly of multiple files written for other assemblers. The SEGMENT Directive Syntax: seg_name SEGMENT [align] [combine] ['class_name'] where "align" is one of: BYTE WORD PARA PAGE "combine" is one of: PUBLIC STACK COMMON MEMORY AT number The SEGMENT directive says that assembled object code will henceforth go to a block of code whose name is "seg_name". "seg_name" is a symbol that represents a value that can be loaded into a segment register. If "seg_name" is not declared in a GROUP directive, then its value should in fact be loaded into a segment register, in order to address the code. If "seg_name" is declared in a GROUP directive, then the code is a a part of the segment addressed by the name of the group. A program can consist of any number of named segments, to be combined in numerous exotic ways to produce the final program. You can redirect your object output from one segment to another in your assembly, by providing a SEGMENT directive before each piece of code. You can even return to a segment you started earlier, by repeating a SEGMENT with the same name-- the assembler just picks up where it left off, subject to some possible skipping for memory alignment, that I'll describe shortly. The specifications following the word SEGMENT help to describe how the code in this module's part of the segment will be combined with code for the same segment name given in other modules; and also how this named segment will be grouped with other named segments. Other assemblers require the specifications to be given in the order indicated. A86 will accept any order, and will accept commas between the specifications if you want to provide them. The only restriction is that "AT number" must be followed by a comma if it is not the last specification on the line. 10-9 The "align" specification tells if each piece of code within the segment should be aligned so that its starting address is an even multiple of some number. BYTE alignment means there is no requirement; WORD alignment requires each piece to start at a multiple of 2; PARA alignment, at a multiple of 16; PAGE alignment, at a multiple of 256. For example, suppose you have a segment containing memory variables. You can declare the segment with the statement "VAR_DATA SEGMENT WORD", which insures that the segment is aligned to an even memory address. That way you can insure that all 16-bit and bigger memory quantities in the segment are aligned to even addresses, for faster access on the 16-bit machines of the 86 family. There are special rules governing alignment for multiple pieces of the same named segment within the same program module. Other assemblers outlaw conflicting alignment specifications in this situation; A86 accepts them, and uses the strictest specification given. Furthermore, the alignment given for any specification beyond the first will control the alignment for that piece of code within this module's chunk. For example, if a program contains two pieces of code headed by "VAR_DATA SEGMENT WORD", A86 will insert a byte between the pieces if the first piece has an odd number of bytes. This insures correct assembly for multiple files written for another assembler. If no "align" type is given for any of the pieces of a named segment, an alignment of PARA is assumed. The "combine" specification tells how the chunk of code from this module will be combined with the chunks of the same named segment, that come from other modules. Yes, I know, that sounds like what "align" does; but "combine" takes a different, more major point of view: