Network Working Group        Richard Winter, Jeffrey Hill, Warren Greiff
RFC # 610                                                            CCA
NIC # 21352                                            December 15, 1973


                  Further Datalanguage Design Concepts


                             Richard Winter
                              Jeffrey Hill
                             Warren Greiff


                    Computer Corporation of America
                           December 15, 1973


Winter, Hill & Greiff                                           [Page 1]

RFC 610           Further Datalanguage Design Concepts     December 1973


                             Acknowledgment

During the course of the Datacomputer Project, many people have
contributed to the development of datalanguage.

The suggestions and criticisms of Dr. Gordon Everest (University of
Minnesota), Dr. Robert Taylor (University of Massachusetts), Professor
Thomas Cheatham (Harvard University) and Professor George Mealy (Harvard
University) have been particularly useful.

Within CCA, several people in addition to the authors have participated
in the language design at various stages of the project. Hal Murray,
Bill Bush, David Shipman and Dale Stern have been especially helpful.


Winter, Hill & Greiff                                           [Page 2]

RFC 610           Further Datalanguage Design Concepts     December 1973


1.  Introduction


1.1 The Datacomputer System

The datacomputer is a large-scale data utility system, offering data
storage and data management services to other computers.

The datacomputer differs from traditional data management systems in
several ways.

First, it is implemented on dedicated hardware, and comprises a separate
computing system specialized for data management.

Second, the system is implemented on a large scale. Data is intended to
be stored on mass storage devices, with capacities in the range of a
trillion bits.  Files on the order of one hundred billion bits are to be
kept online.

Third, it is intended to support sharing of data among processes
operating in diverse environments.  That is, the programs which share a
given data base may be written in different languages, execute on
different hardware under different operating systems, and support end
users with radically different requirements.  To enable such shared use
of a data base, transformations between various hardware representations
and data structuring concepts must be achieved.

Finally, the datacomputer is designed to function smoothly as a
component of a much larger system: a computer network.  In a computer
network, the datacomputer is a node specialized for data management, and
acting as a data utility for the other nodes.  The Arpanet, for which
the datacomputer is being developed, is an international network which
has over 60 nodes.  Of these, some are presently specialized for
terminal handling, others are specialized for computation (e.g., the
ILLIAC IV), some are general purpose service nodes (e.g., MULTICS) and
one (CCA) is specialized for data management.


1.2 Datalanguage

Datalanguage is the language in which all requests to the datacomputer
are stated.  It includes facilities for data description and creation,
for retrieval of or changes to stored data, and for access to a variety
of auxiliary facilities and services.  In datalanguage it is possible to
specify any operation the datacomputer is capable of performing.
Datalanguage is the only language accepted by the datacomputer and is
the exclusive means of access to data and services.


Winter, Hill & Greiff                                           [Page 3]

RFC 610           Further Datalanguage Design Concepts     December 1973


1.3 Present Design Effort

We are now engaged in developing complete specifications for
datalanguage; this is the second iteration in the language design
process.

A smaller, initial design effort developed some concepts and principles
which are described in the third working paper in this series.  These
have been used as the basis of software implementations resulting in an
initial network service capability.  A user manual for this system was
published as working paper number 7.

As a result of experience gained in implementation and service, through
further study of user requirements and work with potential users, and
through investigation of other work in the data management field, quite
a few ideas have been developed for the improvement of datalanguage.
These are being assimilated into the language design in the iteration
now in progress.

When the language design is complete, it will be incorporated into the
existing software (requiring changes to the language compiler, but
having little impact on the rest of the system).

Datacomputer users will first have access to the new language during
1975.


1.4 Purpose of this Paper

This paper presents concepts and preliminary results, rather than a
completed design.  There are two reasons for publishing now.

The first is to provide information to those planning to use the
datacomputer.  They may benefit from knowledge of our intentions for
development.

The second is to enable system and language designers to comment on our
work before the design is frozen.


1.5 Organization of the Paper

The remainder of the paper is divided into four sections.

Section 2 discusses the most global considerations for language design.
This comprises our view of the problem; it has influenced our work to
date and will determine most of our actions in completion of the design.
This section provides background for section 3, and reviews some


Winter, Hill & Greiff                                           [Page 4]

RFC 610           Further Datalanguage Design Concepts     December 1973


material that will be familiar to those who have been following our work
closely.

Section 3 discusses some of the specific issues we have worked on.  The
emphasis is on solutions and options for solution.

In sections 2 and 3 we are presenting our "top-down" work: this is the
thinking we have done based on known requirements and our conception of
the desirable properties of datalanguage.

We have also been working from the opposite end, developing the
primitives from which to construct the language.  Section 4 presents our
work in this area: a model datacomputer which will ultimately provide a
precise semantic definition of datalanguage.  Section 4 explains that
part of the model which is complete, and relates this to our other work.

Section 5 discusses work that remains, both on the model and in our
top-down analysis.


Winter, Hill & Greiff                                           [Page 5]

RFC 610           Further Datalanguage Design Concepts     December 1973


2.  Considerations for Language Design


2.1 Introduction

Data management is the task of managing data as a resource, independent
of hardware and applications programs.  It can be divided it into five
major sub-tasks:

    (1) _creating_ databases in storage,
    (2) making the data _available_ (e.g., satisfying queries),
    (3) _maintaining_ the data as information is added, deleted and
        modified,
    (4) assuring the _integrity_ of the data (e.g., through backup and
        recovery systems, through internal consistency checks),
    (5) _regulating_access_, to protect the databases, the system, and
        the privacy of users.

These are the major data-related functions of the datacomputer; while
the system will ultimately provide other services (such as accounting
for use, monitoring performance) these are really auxiliary and common
to all service facilities.

This section presents global considerations for the design of
datalanguage, based on our observations about the problem and the
environment in which it is to be solved.  The central problem is data
management, and the datacomputer shares the same goals as many currently
available data management systems.  Several aspects of the datacomputer
create a unique set of problems to be solved.


2.2 Hardware Considerations


2.2.1 Separate Box

The datacomputer is a complete data management utility in a separate,
closed box.  That is, the hardware, the data and the data management
software are segregated from any general-purpose processing facilities.
There is a separate installation dedicated to data management.
Datalanguage is the only means users have for communicating with the
datacomputer and the sole activity of the datacomputer is to process
datalanguage requests.

Dedicating hardware provides an obvious advantage: one can specialize it
for data management.  The processor(s) can be modified to have data
management "instructions"; common low-level software functions can be
built into the hardware.


Winter, Hill & Greiff                                           [Page 6]

RFC 610           Further Datalanguage Design Concepts     December 1973


A less obvious, but possibly more significant, advantage is gained from
the separateness itself.  The system can be more easily protected.  A
fully-developed datacomputer on which there is only maintenance activity
can provide a very carefully controlled environment.  First, it can be
made as physically secure as required.  Second, it needs to execute only
system software developed at CCA; all user programs are in a high-level
language (datalanguage) which is effectively interpreted by the system.
Hence, only datacomputer system software processes the data, and the
system is not very vulnerable to capture by a hostile program.  Thus,
since there is the potential to develop data privacy and integrity
services that are not available on general-purpose systems, one can
expect less difficulty in developing privacy controls (including
physical ones) for the datacomputer than for the systems it serves.


2.2.2 Mass Storage Hardware

The datacomputer will store most of its data on mass storage devices,
which have distinctive access characteristics.  Two examples of such
hardware are Precision Instruments' Unicon 690 and Ampex Corporation's
TBM system.  They are quite different from disks, and differ
significantly from one another.

However, almost all users will be ignorant of the characteristics of
these devices; many will not even know that the data they use is at the
datacomputer.  Finally, as the development of the system progresses,
data may be invisibly shunted from one datacomputer to another, and as a
result be stored in a physical format quite different from that
originally used.

In such an environment, it is clear that requests for data should be
stated in logical, not physical terms.


2.3 Network Environment

The network environment provides additional requirements for
datacomputer design.

2.3.1 Remote Use

Since the datacomputer is to be accessed remotely, the requirement for
effective data selection techniques and good mechanisms for the
expression of selection criteria is amplified.  This is because of the
narrow path through which network users communicate with the
datacomputer.  Presently, a typical process-to-process transfer rate
over the Arpanet is 30 kilobits per second.  While this can be increased
through optimization of software and protocols, and through additional


Winter, Hill & Greiff                                           [Page 7]

RFC 610           Further Datalanguage Design Concepts     December 1973


expenditure for hardware and communications lines, it seems safe to
assume that it will not soon approach local transfer rates (measured in
the megabits per second).

A typical request calls for either transfer of part of a file to a
remote site, or for selective update to a file already stored at the
datacomputer.  In both of these situations, good mechanisms for
specifying the parts of the data to be transmitted or changed will
reduce the amount of data ordinarily transferred.  This is extremely
important because with the low per bit cost of storing data at the
datacomputer, transmission costs will be a significant part of the total
cost of datacomputer usage.


2.3.2 Interprocess Use of the Datacomputer System

Effective use of the network requires that groups of processes, remote
from one another, be capable of cooperating to accomplish a given task
or provide a given service.  For example, to solve a given problem which
involves array manipulation, data retrieval, interaction with a user at
a terminal, and the generalized services of a language like PL/I, it may
be most economical to have four cooperating processes.  One of these
could execute at the ILLIAC IV, one at the datacomputer, one at MULTICS,
and one at a TIP.  While there is overhead in setting up these four
processes and in having them communicate, each is doing its job on a
system specialized for that job.  In many cases, the result of using the
specialized system is a gain of several orders of magnitude in economy
or efficiency (for example, online storage at the datacomputer has a
capital cost two orders of magnitude lower than online costs on
conventional systems).  As a result, there is considerable incentive to
consider solutions involving cooperating processes on specialized
systems.

To summarize: the datacomputer must be prepared to function as a
component of small networks of specialized processes, in order that it
can be used effectively in a network in which there are many specialized
nodes.


2.3.3 Common Network Data Handling

A large network can support enough data management hardware to construct
more than one datacomputer.  While this hardware can be combined into
one even larger datacomputer, there are advantages to configuring it as
two (or possibly more) systems.  Each system should be large enough to
obtain economies of scale in data storage and to support the data
management software.  Important data bases can be duplicated, with a
copy at each datacomputer; if one datacomputer fails, or is cut off by


Winter, Hill & Greiff                                           [Page 8]

RFC 610           Further Datalanguage Design Concepts     December 1973


network failure, the data is still available. Even if duplicating the
file is not warranted, the description can be kept at the different
datacomputers so that applications which need to store data constantly
can be guaranteed that at least one datacomputer is available to receive
input.

These kinds of failure protection involve cooperation between a pair of
datacomputers; in some sense, they require that the two datacomputers
function as a single system.  Given a system of datacomputers (which one
can think of as a small network of datacomputers), it is obviously
possible to experiment with providing additional services on the
datacomputer-network level.  For example, all requests could be
addressed simply to the datacomputer-network; the datacomputer-network
could then determine where each referenced file was stored (i.e., which
datacomputer), and how best to satisfy the request.

Here, two kinds of cooperation in the network environment have been
mentioned: cooperation among processes to solve a given problem, and
cooperation among datacomputers to provide global optimizations in the
network-level data handling problem.  These are only two examples,
especially interesting because they can be implemented in the near term.
In the network, much more general kinds of cooperation are possible, if
a little farther in the future.  For example, eventually, one might want
the datacomputer(s) to be part of a network-wide data management system,
in which data, directories, services, and hardware were generally
distributed about the network.  The entire system could function as a
whole under the right circumstances.  Most requests would use the data
and services of only a few nodes.  Within this network-wide system,
there would be more than one data management system, but all systems
would be interfaced through a common language.  Because the
datacomputers represent the largest data management resource in the
network, they would certainly play an important role in any network-wide
system.  The language of the datacomputer (datalanguage) is certainly a
convenient choice for the common language of such a system.

Thus a final