Network Working Group Richard Winter, Jeffrey Hill, Warren Greiff RFC # 610 CCA NIC # 21352 December 15, 1973 Further Datalanguage Design Concepts Richard Winter Jeffrey Hill Warren Greiff Computer Corporation of America December 15, 1973 Winter, Hill & Greiff [Page 1] RFC 610 Further Datalanguage Design Concepts December 1973 Acknowledgment During the course of the Datacomputer Project, many people have contributed to the development of datalanguage. The suggestions and criticisms of Dr. Gordon Everest (University of Minnesota), Dr. Robert Taylor (University of Massachusetts), Professor Thomas Cheatham (Harvard University) and Professor George Mealy (Harvard University) have been particularly useful. Within CCA, several people in addition to the authors have participated in the language design at various stages of the project. Hal Murray, Bill Bush, David Shipman and Dale Stern have been especially helpful. Winter, Hill & Greiff [Page 2] RFC 610 Further Datalanguage Design Concepts December 1973 1. Introduction 1.1 The Datacomputer System The datacomputer is a large-scale data utility system, offering data storage and data management services to other computers. The datacomputer differs from traditional data management systems in several ways. First, it is implemented on dedicated hardware, and comprises a separate computing system specialized for data management. Second, the system is implemented on a large scale. Data is intended to be stored on mass storage devices, with capacities in the range of a trillion bits. Files on the order of one hundred billion bits are to be kept online. Third, it is intended to support sharing of data among processes operating in diverse environments. That is, the programs which share a given data base may be written in different languages, execute on different hardware under different operating systems, and support end users with radically different requirements. To enable such shared use of a data base, transformations between various hardware representations and data structuring concepts must be achieved. Finally, the datacomputer is designed to function smoothly as a component of a much larger system: a computer network. In a computer network, the datacomputer is a node specialized for data management, and acting as a data utility for the other nodes. The Arpanet, for which the datacomputer is being developed, is an international network which has over 60 nodes. Of these, some are presently specialized for terminal handling, others are specialized for computation (e.g., the ILLIAC IV), some are general purpose service nodes (e.g., MULTICS) and one (CCA) is specialized for data management. 1.2 Datalanguage Datalanguage is the language in which all requests to the datacomputer are stated. It includes facilities for data description and creation, for retrieval of or changes to stored data, and for access to a variety of auxiliary facilities and services. In datalanguage it is possible to specify any operation the datacomputer is capable of performing. Datalanguage is the only language accepted by the datacomputer and is the exclusive means of access to data and services. Winter, Hill & Greiff [Page 3] RFC 610 Further Datalanguage Design Concepts December 1973 1.3 Present Design Effort We are now engaged in developing complete specifications for datalanguage; this is the second iteration in the language design process. A smaller, initial design effort developed some concepts and principles which are described in the third working paper in this series. These have been used as the basis of software implementations resulting in an initial network service capability. A user manual for this system was published as working paper number 7. As a result of experience gained in implementation and service, through further study of user requirements and work with potential users, and through investigation of other work in the data management field, quite a few ideas have been developed for the improvement of datalanguage. These are being assimilated into the language design in the iteration now in progress. When the language design is complete, it will be incorporated into the existing software (requiring changes to the language compiler, but having little impact on the rest of the system). Datacomputer users will first have access to the new language during 1975. 1.4 Purpose of this Paper This paper presents concepts and preliminary results, rather than a completed design. There are two reasons for publishing now. The first is to provide information to those planning to use the datacomputer. They may benefit from knowledge of our intentions for development. The second is to enable system and language designers to comment on our work before the design is frozen. 1.5 Organization of the Paper The remainder of the paper is divided into four sections. Section 2 discusses the most global considerations for language design. This comprises our view of the problem; it has influenced our work to date and will determine most of our actions in completion of the design. This section provides background for section 3, and reviews some Winter, Hill & Greiff [Page 4] RFC 610 Further Datalanguage Design Concepts December 1973 material that will be familiar to those who have been following our work closely. Section 3 discusses some of the specific issues we have worked on. The emphasis is on solutions and options for solution. In sections 2 and 3 we are presenting our "top-down" work: this is the thinking we have done based on known requirements and our conception of the desirable properties of datalanguage. We have also been working from the opposite end, developing the primitives from which to construct the language. Section 4 presents our work in this area: a model datacomputer which will ultimately provide a precise semantic definition of datalanguage. Section 4 explains that part of the model which is complete, and relates this to our other work. Section 5 discusses work that remains, both on the model and in our top-down analysis. Winter, Hill & Greiff [Page 5] RFC 610 Further Datalanguage Design Concepts December 1973 2. Considerations for Language Design 2.1 Introduction Data management is the task of managing data as a resource, independent of hardware and applications programs. It can be divided it into five major sub-tasks: (1) _creating_ databases in storage, (2) making the data _available_ (e.g., satisfying queries), (3) _maintaining_ the data as information is added, deleted and modified, (4) assuring the _integrity_ of the data (e.g., through backup and recovery systems, through internal consistency checks), (5) _regulating_access_, to protect the databases, the system, and the privacy of users. These are the major data-related functions of the datacomputer; while the system will ultimately provide other services (such as accounting for use, monitoring performance) these are really auxiliary and common to all service facilities. This section presents global considerations for the design of datalanguage, based on our observations about the problem and the environment in which it is to be solved. The central problem is data management, and the datacomputer shares the same goals as many currently available data management systems. Several aspects of the datacomputer create a unique set of problems to be solved. 2.2 Hardware Considerations 2.2.1 Separate Box The datacomputer is a complete data management utility in a separate, closed box. That is, the hardware, the data and the data management software are segregated from any general-purpose processing facilities. There is a separate installation dedicated to data management. Datalanguage is the only means users have for communicating with the datacomputer and the sole activity of the datacomputer is to process datalanguage requests. Dedicating hardware provides an obvious advantage: one can specialize it for data management. The processor(s) can be modified to have data management "instructions"; common low-level software functions can be built into the hardware. Winter, Hill & Greiff [Page 6] RFC 610 Further Datalanguage Design Concepts December 1973 A less obvious, but possibly more significant, advantage is gained from the separateness itself. The system can be more easily protected. A fully-developed datacomputer on which there is only maintenance activity can provide a very carefully controlled environment. First, it can be made as physically secure as required. Second, it needs to execute only system software developed at CCA; all user programs are in a high-level language (datalanguage) which is effectively interpreted by the system. Hence, only datacomputer system software processes the data, and the system is not very vulnerable to capture by a hostile program. Thus, since there is the potential to develop data privacy and integrity services that are not available on general-purpose systems, one can expect less difficulty in developing privacy controls (including physical ones) for the datacomputer than for the systems it serves. 2.2.2 Mass Storage Hardware The datacomputer will store most of its data on mass storage devices, which have distinctive access characteristics. Two examples of such hardware are Precision Instruments' Unicon 690 and Ampex Corporation's TBM system. They are quite different from disks, and differ significantly from one another. However, almost all users will be ignorant of the characteristics of these devices; many will not even know that the data they use is at the datacomputer. Finally, as the development of the system progresses, data may be invisibly shunted from one datacomputer to another, and as a result be stored in a physical format quite different from that originally used. In such an environment, it is clear that requests for data should be stated in logical, not physical terms. 2.3 Network Environment The network environment provides additional requirements for datacomputer design. 2.3.1 Remote Use Since the datacomputer is to be accessed remotely, the requirement for effective data selection techniques and good mechanisms for the expression of selection criteria is amplified. This is because of the narrow path through which network users communicate with the datacomputer. Presently, a typical process-to-process transfer rate over the Arpanet is 30 kilobits per second. While this can be increased through optimization of software and protocols, and through additional Winter, Hill & Greiff [Page 7] RFC 610 Further Datalanguage Design Concepts December 1973 expenditure for hardware and communications lines, it seems safe to assume that it will not soon approach local transfer rates (measured in the megabits per second). A typical request calls for either transfer of part of a file to a remote site, or for selective update to a file already stored at the datacomputer. In both of these situations, good mechanisms for specifying the parts of the data to be transmitted or changed will reduce the amount of data ordinarily transferred. This is extremely important because with the low per bit cost of storing data at the datacomputer, transmission costs will be a significant part of the total cost of datacomputer usage. 2.3.2 Interprocess Use of the Datacomputer System Effective use of the network requires that groups of processes, remote from one another, be capable of cooperating to accomplish a given task or provide a given service. For example, to solve a given problem which involves array manipulation, data retrieval, interaction with a user at a terminal, and the generalized services of a language like PL/I, it may be most economical to have four cooperating processes. One of these could execute at the ILLIAC IV, one at the datacomputer, one at MULTICS, and one at a TIP. While there is overhead in setting up these four processes and in having them communicate, each is doing its job on a system specialized for that job. In many cases, the result of using the specialized system is a gain of several orders of magnitude in economy or efficiency (for example, online storage at the datacomputer has a capital cost two orders of magnitude lower than online costs on conventional systems). As a result, there is considerable incentive to consider solutions involving cooperating processes on specialized systems. To summarize: the datacomputer must be prepared to function as a component of small networks of specialized processes, in order that it can be used effectively in a network in which there are many specialized nodes. 2.3.3 Common Network Data Handling A large network can support enough data management hardware to construct more than one datacomputer. While this hardware can be combined into one even larger datacomputer, there are advantages to configuring it as two (or possibly more) systems. Each system should be large enough to obtain economies of scale in data storage and to support the data management software. Important data bases can be duplicated, with a copy at each datacomputer; if one datacomputer fails, or is cut off by Winter, Hill & Greiff [Page 8] RFC 610 Further Datalanguage Design Concepts December 1973 network failure, the data is still available. Even if duplicating the file is not warranted, the description can be kept at the different datacomputers so that applications which need to store data constantly can be guaranteed that at least one datacomputer is available to receive input. These kinds of failure protection involve cooperation between a pair of datacomputers; in some sense, they require that the two datacomputers function as a single system. Given a system of datacomputers (which one can think of as a small network of datacomputers), it is obviously possible to experiment with providing additional services on the datacomputer-network level. For example, all requests could be addressed simply to the datacomputer-network; the datacomputer-network could then determine where each referenced file was stored (i.e., which datacomputer), and how best to satisfy the request. Here, two kinds of cooperation in the network environment have been mentioned: cooperation among processes to solve a given problem, and cooperation among datacomputers to provide global optimizations in the network-level data handling problem. These are only two examples, especially interesting because they can be implemented in the near term. In the network, much more general kinds of cooperation are possible, if a little farther in the future. For example, eventually, one might want the datacomputer(s) to be part of a network-wide data management system, in which data, directories, services, and hardware were generally distributed about the network. The entire system could function as a whole under the right circumstances. Most requests would use the data and services of only a few nodes. Within this network-wide system, there would be more than one data management system, but all systems would be interfaced through a common language. Because the datacomputers represent the largest data management resource in the network, they would certainly play an important role in any network-wide system. The language of the datacomputer (datalanguage) is certainly a convenient choice for the common language of such a system. Thus a final