gemini - kennedy.gemi.dev

💾 Archived View for gmi.noulin.net › man › man2 › seccomp_unotify.2.gmi captured on 2024-08-19 at 05:36:14. Gemini links have been rewritten to link to archived content
-=-=-=-=-=-=-
SECCOMP_UNOTIFY(2)                                                      Linux Programmer's Manual                                                     SECCOMP_UNOTIFY(2)

NAME
       seccomp_unotify - Seccomp user-space notification mechanism

SYNOPSIS
       #include <linux/seccomp.h>
       #include <linux/filter.h>
       #include <linux/audit.h>

       int seccomp(unsigned int operation, unsigned int flags, void *args);

       #include <sys/ioctl.h>

       int ioctl(int fd, SECCOMP_IOCTL_NOTIF_RECV,
                 struct seccomp_notif *req);
       int ioctl(int fd, SECCOMP_IOCTL_NOTIF_SEND,
                 struct seccomp_notif_resp *resp);
       int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ID_VALID, __u64 *id);
       int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ADDFD,
                 struct seccomp_notif_addfd *addfd);

DESCRIPTION
       This  page  describes  the  user-space  notification  mechanism  provided  by  the  Secure  Computing (seccomp) facility.  As well as the use of the SECCOMP_FIL‐
       TER_FLAG_NEW_LISTENER flag, the SECCOMP_RET_USER_NOTIF action value, and the SECCOMP_GET_NOTIF_SIZES operation described in seccomp(2), this  mechanism  involves
       the use of a number of related ioctl(2) operations (described below).

   Overview
       In  conventional usage of a seccomp filter, the decision about how to treat a system call is made by the filter itself.  By contrast, the user-space notification
       mechanism allows the seccomp filter to delegate the handling of the system call to another user-space process.  Note that this mechanism is  explicitly  not  in‐
       tended as a method implementing security policy; see NOTES.

       In  the  discussion  that follows, the thread(s) on which the seccomp filter is installed is (are) referred to as the target, and the process that is notified by
       the user-space notification mechanism is referred to as the supervisor.

       A suitably privileged supervisor can use the user-space notification mechanism to perform actions on behalf of the target.  The advantage of the user-space noti‐
       fication mechanism is that the supervisor will usually be able to retrieve information about the target and the performed system call that the seccomp filter it‐
       self cannot.  (A seccomp filter is limited in the information it can obtain and the actions that it can perform because it is running on a virtual machine inside
       the kernel.)

       An overview of the steps performed by the target and the supervisor is as follows:

       1. The target establishes a seccomp filter in the usual manner, but with two differences:

          • The  seccomp(2)  flags argument includes the flag SECCOMP_FILTER_FLAG_NEW_LISTENER.  Consequently, the return value of the (successful) seccomp(2) call is a
            new "listening" file descriptor that can be used to receive notifications.  Only one "listening" seccomp filter can be installed for a thread.

          • In cases where it is appropriate, the seccomp filter returns the action value SECCOMP_RET_USER_NOTIF.  This return value will trigger a notification event.

       2. In order that the supervisor can obtain notifications using the listening file descriptor, (a duplicate of) that file descriptor must be passed from the  tar‐
          get  to the supervisor.  One way in which this could be done is by passing the file descriptor over a UNIX domain socket connection between the target and the
          supervisor (using the SCM_RIGHTS ancillary message type described in unix(7)).  Another way to do this is through the use of pidfd_getfd(2).

       3. The supervisor will receive notification events on the listening file descriptor.  These events are returned as structures  of  type  seccomp_notif.   Because
          this  structure  and  its  size  may  evolve  over  kernel  versions, the supervisor must first determine the size of this structure using the seccomp(2) SEC‐
          COMP_GET_NOTIF_SIZES operation, which returns a structure of type seccomp_notif_sizes.  The supervisor allocates a  buffer  of  size  seccomp_notif_sizes.sec‐
          comp_notif bytes to receive notification events.  In addition,the supervisor allocates another buffer of size seccomp_notif_sizes.seccomp_notif_resp bytes for
          the response (a struct seccomp_notif_resp structure) that it will provide to the kernel (and thus the target).

       4. The target then performs its workload, which includes system calls that will be controlled by the seccomp filter.  Whenever one of these system  calls  causes
          the filter to return the SECCOMP_RET_USER_NOTIF action value, the kernel does not (yet) execute the system call; instead, execution of the target is temporar‐
          ily blocked inside the kernel (in a sleep state that is interruptible by signals) and a notification event is generated on the listening file descriptor.

       5. The supervisor can now repeatedly monitor the listening file descriptor for SECCOMP_RET_USER_NOTIF-triggered events.  To do this, the supervisor uses the SEC‐
          COMP_IOCTL_NOTIF_RECV ioctl(2) operation to read information about a notification event; this operation blocks until an event is available.  The operation re‐
          turns a seccomp_notif structure containing information about the system call that is being attempted by the target.  (As described in NOTES, the file descrip‐
          tor can also be monitored with select(2), poll(2), or epoll(7).)

       6. The seccomp_notif structure returned by the SECCOMP_IOCTL_NOTIF_RECV operation includes the same information (a seccomp_data structure) that was passed to the
          seccomp filter.  This information allows the supervisor to discover the system call number and the arguments for the target's system call.  In  addition,  the
          notification  event  contains  the  ID  of  the  thread that triggered the notification and a unique cookie value that is used in subsequent SECCOMP_IOCTL_NO‐
          TIF_ID_VALID and SECCOMP_IOCTL_NOTIF_SEND operations.

          The information in the notification can be used to discover the values of pointer arguments for the target's system call.  (This is something  that  can't  be
          done  from  within  a  seccomp  filter.)  One way in which the supervisor can do this is to open the corresponding /proc/[tid]/mem file (see proc(5)) and read
          bytes from the location that corresponds to one of the pointer arguments whose value is supplied in the notification event.  (The supervisor must  be  careful
          to avoid a race condition that can occur when doing this; see the description of the SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.)  In addition, the
          supervisor can access other system information that is visible in user space but which is not accessible from a seccomp filter.

       7. Having obtained information as per the previous step, the supervisor may then choose to perform an action in response to the target's system call  (which,  as
          noted above, is not executed when the seccomp filter returns the SECCOMP_RET_USER_NOTIF action value).

          One  example  use  case  here  relates  to containers.  The target may be located inside a container where it does not have sufficient capabilities to mount a
          filesystem in the container's mount namespace.  However, the supervisor may be a more privileged process that does have sufficient capabilities to perform the
          mount operation.

       8. The supervisor then sends a response to the notification.  The information in this response is used by the kernel to construct a return value for the target's
          system call and provide a value that will be assigned to the errno variable of the target.

          The response is sent using the SECCOMP_IOCTL_NOTIF_SEND ioctl(2) operation, which is used to transmit a seccomp_notif_resp  structure  to  the  kernel.   This
          structure includes a cookie value that the supervisor obtained in the seccomp_notif structure returned by the SECCOMP_IOCTL_NOTIF_RECV operation.  This cookie
          value allows the kernel to associate the response with the target.  This structure must include the cookie value that the  supervisor  obtained  in  the  sec‐
          comp_notif structure returned by the SECCOMP_IOCTL_NOTIF_RECV operation; the cookie allows the kernel to associate the response with the target.

       9. Once the notification has been sent, the system call in the target thread unblocks, returning the information that was provided by the supervisor in the noti‐
          fication response.

       As a variation on the last two steps, the supervisor can send a response that tells the kernel that it should execute the target thread's system  call;  see  the
       discussion of SECCOMP_USER_NOTIF_FLAG_CONTINUE, below.

IOCTL OPERATIONS
       The  following  ioctl(2) operations are supported by the seccomp user-space notification file descriptor.  For each of these operations, the first (file descrip‐
       tor) argument of ioctl(2) is the listening file descriptor returned by a call to seccomp(2) with the SECCOMP_FILTER_FLAG_NEW_LISTENER flag.

   SECCOMP_IOCTL_NOTIF_RECV
       The SECCOMP_IOCTL_NOTIF_RECV operation (available since Linux 5.0) is used to obtain a user-space notification event.  If no such event is currently pending, the
       operation  blocks  until  an  event  occurs.   The third ioctl(2) argument is a pointer to a structure of the following form which contains information about the
       event.  This structure must be zeroed out before the call.

           struct seccomp_notif {
               __u64  id;              /* Cookie */
               __u32  pid;             /* TID of target thread */
               __u32  flags;           /* Currently unused (0) */
               struct seccomp_data data;   /* See seccomp(2) */
           };

       The fields in this structure are as follows:

       id     This is a cookie for the notification.  Each such cookie is guaranteed to be unique for the corresponding seccomp filter.

              • The cookie can be used with the SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation described below.

              • When returning a notification response to the kernel, the supervisor must include the cookie value in the seccomp_notif_resp structure that is specified
                as the argument of the SECCOMP_IOCTL_NOTIF_SEND operation.

       pid    This is the thread ID of the target thread that triggered the notification event.

       flags  This is a bit mask of flags providing further information on the event.  In the current implementation, this field is always zero.

       data   This  is a seccomp_data structure containing information about the system call that triggered the notification.  This is the same structure that is passed
              to the seccomp filter.  See seccomp(2) for details of this structure.

       On success, this operation returns 0; on failure, -1 is returned, and errno is set to indicate the cause of the error.  This operation can fail with the  follow‐
       ing errors:

       EINVAL (since Linux 5.5)
              The seccomp_notif structure that was passed to the call contained nonzero fields.

       ENOENT The  target  thread was killed by a signal as the notification information was being generated, or the target's (blocked) system call was interrupted by a
              signal handler.

   SECCOMP_IOCTL_NOTIF_ID_VALID
       The SECCOMP_IOCTL_NOTIF_ID_VALID operation (available since Linux 5.0) is used to check that a notification ID returned by  an  earlier  SECCOMP_IOCTL_NOTIF_RECV
       operation is still valid (i.e., that the target still exists and its system call is still blocked waiting for a response).

       The third ioctl(2) argument is a pointer to the cookie (id) returned by the SECCOMP_IOCTL_NOTIF_RECV operation.

       This  operation is necessary to avoid race conditions that can occur when the pid returned by the SECCOMP_IOCTL_NOTIF_RECV operation terminates, and that process
       ID is reused by another process.  An example of this kind of race is the following

       1. A notification is generated on the listening file descriptor.  The returned seccomp_notif contains the TID of the target thread  (in  the  pid  field  of  the
          structure).

       2. The target terminates.

       3. Another thread or process is created on the system that by chance reuses the TID that was freed when the target terminated.

       4. The supervisor open(2)s the /proc/[tid]/mem file for the TID obtained in step 1, with the intention of (say) inspecting the memory location(s) that containing
          the argument(s) of the system call that triggered the notification in step 1.

       In the above scenario, the risk is that the supervisor may try to access the memory of a process other than the target.  This race can be  avoided  by  following
       the  call  to open(2) with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to verify that the process that generated the notification is still alive.  (Note that if the
       target terminates after the latter step, a subsequent read(2) from the file descriptor may return 0, indicating end of file.)

       See NOTES for a discussion of other cases where SECCOMP_IOCTL_NOTIF_ID_VALID checks must be performed.

       On success (i.e., the notification ID is still valid), this operation returns 0.  On failure (i.e., the notification ID is no longer valid), -1 is returned,  and
       errno is set to ENOENT.

   SECCOMP_IOCTL_NOTIF_SEND
       The  SECCOMP_IOCTL_NOTIF_SEND  operation  (available since Linux 5.0) is used to send a notification response back to the kernel.  The third ioctl(2) argument of
       this structure is a pointer to a structure of the following form:

           struct seccomp_notif_resp {
               __u64 id;           /* Cookie value */
               __s64 val;          /* Success return value */
               __s32 error;        /* 0 (success) or negative error number */
               __u32 flags;        /* See below */
           };

       The fields of this structure are as follows:

       id     This is the cookie value that was obtained using the SECCOMP_IOCTL_NOTIF_RECV operation.  This cookie value allows the kernel to correctly associate  this
              response with the system call that triggered the user-space notification.

       val    This is the value that will be used for a spoofed success return for the target's system call; see below.

       error  This is the value that will be used as the error number (errno) for a spoofed error return for the target's system call; see below.

       flags  This is a bit mask that includes zero or more of the following flags:

              SECCOMP_USER_NOTIF_FLAG_CONTINUE (since Linux 5.5)
                     Tell the kernel to execute the target's system call.

       Two kinds of response are possible:

       • A response to the kernel telling it to execute the target's system call.  In this case, the flags field includes SECCOMP_USER_NOTIF_FLAG_CONTINUE and the error
         and val fields must be zero.

         This kind of response can be useful in cases where the supervisor needs to do deeper analysis of the target's system call than is possible from a seccomp  fil‐
         ter  (e.g.,  examining  the values of pointer arguments), and, having decided that the system call does not require emulation by the supervisor, the supervisor
         wants the system call to be executed normally in the target.

         The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag should be used with caution; see NOTES.

       • A spoofed return value for the target's system call.  In this case, the kernel does not execute the target's system call, instead causing the  system  call  to
         return a spoofed value as specified by fields of the seccomp_notif_resp structure.  The supervisor should set the fields of this structure as follows:

         +  flags does not contain SECCOMP_USER_NOTIF_FLAG_CONTINUE.

         +  error is set either to 0 for a spoofed "success" return or to a negative error number for a spoofed "failure" return.  In the former case, the kernel causes
            the target's system call to return the value specified in the val field.  In the latter case, the kernel causes the target's system call to return  -1,  and
            errno is assigned the negated error value.

         +  val is set to a value that will be used as the return value for a spoofed "success" return for the target's system call.  The value in this field is ignored
            if the error field contains a nonzero value.

       On success, this operation returns 0; on failure, -1 is returned, and errno is set to indicate the cause of the error.  This operation can fail with the  follow‐
       ing errors:

       EINPROGRESS
              A response to this notification has already been sent.

       EINVAL An invalid value was specified in the flags field.

       EINVAL The flags field contained SECCOMP_USER_NOTIF_FLAG_CONTINUE, and the error or val field was not zero.

       ENOENT The blocked system call in the target has been interrupted by a signal handler or the target has terminated.

   SECCOMP_IOCTL_NOTIF_ADDFD
       The  SECCOMP_IOCTL_NOTIF_ADDFD  operation (available since Linux 5.9) allows the supervisor to install a file descriptor into the target's file descriptor table.
       Much like the use of SCM_RIGHTS messages described in unix(7), this operation is semantically equivalent to duplicating a file descriptor from  the  supervisor's
       file descriptor table into the target's file descriptor table.

       The SECCOMP_IOCTL_NOTIF_ADDFD operation permits the supervisor to emulate a target system call (such as socket(2) or openat(2)) that generates a file descriptor.
       The supervisor can perform the system call that generates the file descriptor (and associated open file description) and then use this operation  to  allocate  a
       file descriptor that refers to the same open file description in the target.  (For an explanation of open file descriptions, see open(2).)

       Once this operation has been performed, the supervisor can close its copy of the file descriptor.

       In  the target, the received file descriptor is subject to the same Linux Security Module (LSM) checks as are applied to a file descriptor that is received in an
       SCM_RIGHTS ancillary message.  If the file descriptor refers to a socket, it inherits the cgroup version 1 network controller settings (classid  and  netprioidx)
       of the target.

       The third ioctl(2) argument is a pointer to a structure of the following form:

           struct seccomp_notif_addfd {
               __u64 id;           /* Cookie value */
               __u32 flags;        /* Flags */
               __u32 srcfd;        /* Local file descriptor number */
               __u32 newfd;        /* 0 or desired file descriptor
                                      number in target */
               __u32 newfd_flags;  /* Flags to set on target file
                                      descriptor */
           };

       The fields in this structure are as follows:

       id     This field should be set to the notification ID (cookie value) that was obtained via SECCOMP_IOCTL_NOTIF_RECV.

       flags  This field is a bit mask of flags that modify the behavior of the operation.  Currently, only one flag is supported:

              SECCOMP_ADDFD_FLAG_SETFD
                     When allocating the file descriptor in the target, use the file descriptor number specified in the newfd field.

              SECCOMP_ADDFD_FLAG_SEND (since Linux 5.14)
                     Perform  the  equivalent  of  SECCOMP_IOCTL_NOTIF_ADDFD plus SECCOMP_IOCTL_NOTIF_SEND as an atomic operation.  On successful invocation, the target
                     process's errno will be 0 and the return value will be the file descriptor number that was allocated in the target.  If  allocating  the  file  de‐
                     scriptor in the target fails, the target's system call continues to be blocked until a successful response is sent.

       srcfd  This field should be set to the number of the file descriptor in the supervisor that is to be duplicated.

       newfd  This  field  determines  which  file descriptor number is allocated in the target.  If the SECCOMP_ADDFD_FLAG_SETFD flag is set, then this field specifies
              which file descriptor number should be allocated.  If this file descriptor number is already open in the target, it is atomically closed and  reused.   If
              the descriptor duplication fails due to an LSM check, or if srcfd is not a valid file descriptor, the file descriptor newfd will not be closed in the tar‐
              get process.

              If the SECCOMP_ADDFD_FLAG_SETFD flag it not set, then this field must be 0, and the kernel allocates the lowest unused file descriptor number in the  tar‐
              get.

       newfd_flags
              This field is a bit mask specifying flags that should be set on the file descriptor that is received in the target process.  Currently, only the following
              flag is implemented:

              O_CLOEXEC
                     Set the close-on-exec flag on the received file descriptor.

       On success, this ioctl(2) call returns the number of the file descriptor that was allocated in the target.  Assuming that the emulated system call  is  one  that
       returns  a  file descriptor as its function result (e.g., socket(2)), this value can be used as the return value (resp.val) that is supplied in the response that
       is subsequently sent with the SECCOMP_IOCTL_NOTIF_SEND operation.

       On error, -1 is returned and errno is set to indicate the cause of the error.

       This operation can fail with the following errors:

       EBADF  Allocating the file descriptor in the target would cause the target's RLIMIT_NOFILE limit to be exceeded (see getrlimit(2)).

       EBUSY  If the flag SECCOMP_IOCTL_NOTIF_SEND is used, this means the operation can't proceed until other SECCOMP_IOCTL_NOTIF_ADDFD requests are processed.

       EINPROGRESS
              The user-space notification specified in the id field exists but has not yet been fetched (by a SECCOMP_IOCTL_NOTIF_RECV) or has already been responded to
              (by a SECCOMP_IOCTL_NOTIF_SEND).

       EINVAL An  invalid flag was specified in the flags or newfd_flags field, or the newfd field is nonzero and the SECCOMP_ADDFD_FLAG_SETFD flag was not specified in
              the flags field.

       EMFILE The file descriptor number specified in newfd exceeds the limit specified in /proc/sys/fs/nr_open.

       ENOENT The blocked system call in the target has been interrupted by a signal handler or the target has terminated.

       Here is some sample code (with error handling omitted) that uses the SECCOMP_ADDFD_FLAG_SETFD operation (here, to emulate a call to openat(2)):

           int fd, removeFd;

           fd = openat(req->data.args[0], path, req->data.args[2],
                           req->data.args[3]);

           struct seccomp_notif_addfd addfd;
           addfd.id = req->id; /* Cookie from SECCOMP_IOCTL_NOTIF_RECV */
           addfd.srcfd = fd;
           addfd.newfd = 0;
           addfd.flags = 0;
           addfd.newfd_flags = O_CLOEXEC;

           targetFd = ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd);

           close(fd);          /* No longer needed in supervisor */

           struct seccomp_notif_resp *resp;
               /* Code to allocate 'resp' omitted */
           resp->id = req->id;
           resp->error = 0;        /* "Success" */
           resp->val = targetFd;
           resp->flags = 0;
           ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_SEND, resp);

NOTES
       One example use case for the user-space notification mechanism is to allow a container manager (a process which is typically running with more privilege than the
       processes  inside  the  container)  to  mount  block  devices or create device nodes for the container.  The mount use case provides an example of where the SEC‐
       COMP_USER_NOTIF_FLAG_CONTINUE ioctl(2) operation is useful.  Upon receiving a notification for the mount(2) system call, the container manager (the "supervisor")
       can  distinguish a request to mount a block filesystem (which would not be possible for a "target" process inside the container) and mount that file system.  If,
       on the other hand, the container manager detects that the operation could be performed by the process inside the container (e.g., a mount of a tmpfs(5)  filesys‐
       tem), it can notify the kernel that the target process's mount(2) system call can continue.

   select()/poll()/epoll semantics
       The  file descriptor returned when seccomp(2) is employed with the SECCOMP_FILTER_FLAG_NEW_LISTENER flag can be monitored using poll(2), epoll(7), and select(2).
       These interfaces indicate that the file descriptor is ready as follows:

       • When a notification is pending, these interfaces indicate that the file descriptor is readable.  Following such an indication, a  subsequent  SECCOMP_IOCTL_NO‐
         TIF_RECV  ioctl(2)  will  not  block, returning either information about a notification or else failing with the error EINTR if the target has been killed by a
         signal or its system call has been interrupted by a signal handler.

       • After the notification has been received (i.e., by the SECCOMP_IOCTL_NOTIF_RECV ioctl(2) operation), these interfaces indicate  that  the  file  descriptor  is
         writable, meaning that a notification response can be sent using the SECCOMP_IOCTL_NOTIF_SEND ioctl(2) operation.

       • After  the  last  thread  using the filter has terminated and been reaped using waitpid(2) (or similar), the file descriptor indicates an end-of-file condition
         (readable in select(2); POLLHUP/EPOLLHUP in poll(2)/ epoll_wait(2)).

   Design goals; use of SECCOMP_USER_NOTIF_FLAG_CONTINUE
       The intent of the user-space notification feature is to allow system calls to be performed on behalf of the target.  The target's system call  should  either  be
       handled by the supervisor or allowed to continue normally in the kernel (where standard security policies will be applied).

       Note  well:  this  mechanism must not be used to make security policy decisions about the system call, which would be inherently race-prone for reasons described
       next.

       The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag must be used with caution.  If set by the supervisor, the target's system call will  continue.   However,  there  is  a
       time-of-check,  time-of-use  race here, since an attacker could exploit the interval of time where the target is blocked waiting on the "continue" response to do
       things such as rewriting the system call arguments.

       Note furthermore that a user-space notifier can be bypassed if the existing filters allow the use of seccomp(2) or prctl(2) to install a filter that  returns  an
       action value with a higher precedence than SECCOMP_RET_USER_NOTIF (see seccomp(2)).

       It  should  thus  be  absolutely clear that the seccomp user-space notification mechanism can not be used to implement a security policy!  It should only ever be
       used in scenarios where a more privileged process supervises the system calls of a lesser privileged target to get around kernel-enforced  security  restrictions
       when  the  supervisor  deems this safe.  In other words, in order to continue a system call, the supervisor should be sure that another security mechanism or the
       kernel itself will sufficiently block the system call if its arguments are rewritten to something unsafe.

   Caveats regarding the use of /proc/[tid]/mem
       The discussion above noted the need to use the SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) when opening the /proc/[tid]/mem file of the target to avoid the possibility
       of  accessing  the memory of the wrong process in the event that the target terminates and its ID is recycled by another (unrelated) thread.  However, the use of
       this ioctl(2) operation is also necessary in other situations, as explained in the following paragraphs.

       Consider the following scenario, where the supervisor tries to read the pathname argument of a target's blocked mount(2) system call:

       • From one of its functions (func()), the target calls mount(2), which triggers a user-space notification and causes the target to block.

       • The supervisor receives the notification, opens /proc/[tid]/mem, and (successfully) performs the SECCOMP_IOCTL_NOTIF_ID_VALID check.

       • The target receives a signal, which causes the mount(2) to abort.

       • The signal handler executes in the target, and returns.

       • Upon return from the handler, the execution of func() resumes, and it returns (and perhaps other functions are called, overwriting the  memory  that  had  been
         used for the stack frame of func()).

       • Using the address provided in the notification information, the supervisor reads from the target's memory location that used to contain the pathname.

       • The supervisor now calls mount(2) with some arbitrary bytes obtained in the previous step.

       The  conclusion from the above scenario is this: since the target's blocked system call may be interrupted by a signal handler, the supervisor must be written to
       expect that the target may abandon its system call at any time; in such an event, any information that the supervisor obtained from the target's memory  must  be
       considered invalid.

       To  prevent  such scenarios, every read from the target's memory must be separated from use of the bytes so obtained by a SECCOMP_IOCTL_NOTIF_ID_VALID check.  In
       the above example, the check would be placed between the two final steps.  An example of such a check is shown in EXAMPLES.

       Following on from the above, it should be clear that a write by the supervisor into the target's memory can never be considered safe.

   Caveats regarding blocking system calls
       Suppose that the target performs a blocking system call (e.g., accept(2)) that the supervisor should handle.  The supervisor might then in turn execute the  same
       blocking system call.

       In this scenario, it is important to note that if the target's system call is now interrupted by a signal, the supervisor is not informed of this.  If the super‐
       visor does not take suitable steps to actively discover that the target's system call has been canceled, various difficulties can occur.  Taking the  example  of
       accept(2),  the supervisor might remain blocked in its accept(2) holding a port number that the target (which, after the interruption by the signal handler, per‐
       haps closed  its listening socket) might expect to be able to reuse in a bind(2) call.

       Therefore, when the supervisor wishes to emulate a blocking system call, it must do so in such a way that it gets informed if the target's system call is  inter‐
       rupted  by  a signal handler.  For example, if the supervisor itself executes the same blocking system call, then it could employ a separate thread that uses the
       SECCOMP_IOCTL_NOTIF_ID_VALID operation to check if the target is still blocked in its system call.  Alternatively, in the accept(2) example, the supervisor might
       use poll(2) to monitor both the notification file descriptor (so as to discover when the target's accept(2) call has been interrupted) and the listening file de‐
       scriptor (so as to know when a connection is available).

       If the target's system call is interrupted, the supervisor must take care to release resources (e.g., file descriptors) that it acquired on behalf of the target.

   Interaction with SA_RESTART signal handlers
       Consider the following scenario:

       • The target process has used sigaction(2) to install a signal handler with the SA_RESTART flag.

       • The target has made a system call that triggered a seccomp user-space notification and the target is currently blocked until the supervisor sends  a  notifica‐
         tion response.

       • A signal is delivered to the target and the signal handler is executed.

       • When (if) the supervisor attempts to send a notification response, the SECCOMP_IOCTL_NOTIF_SEND ioctl(2)) operation will fail with the ENOENT error.

       In  this scenario, the kernel will restart the target's system call.  Consequently, the supervisor will receive another user-space notification.  Thus, depending
       on how many times the blocked system call is interrupted by a signal handler, the supervisor may receive multiple notifications for the same instance of a system
       call in the target.

       One  oddity  is that system call restarting as described in this scenario will occur even for the blocking system calls listed in signal(7) that would never nor‐
       mally be restarted by the SA_RESTART flag.

       Furthermore, if the supervisor response is a file descriptor added with SECCOMP_IOCTL_NOTIF_ADDFD, then the flag SECCOMP_ADDFD_FLAG_SEND can be  used  to  atomi‐
       cally add the file descriptor and return that value, making sure no file descriptors are inadvertently leaked into the target.

BUGS
       If  a  SECCOMP_IOCTL_NOTIF_RECV ioctl(2) operation is performed after the target terminates, then the ioctl(2) call simply blocks (rather than returning an error
       to indicate that the target no longer exists).

EXAMPLES
       The (somewhat contrived) program shown below demonstrates the use of the interfaces described in this page.  The program creates a child process that  serves  as
       the "target" process.  The child process installs a seccomp filter that returns the SECCOMP_RET_USER_NOTIF action value if a call is made to mkdir(2).  The child
       process then calls mkdir(2) once for each of the supplied command-line arguments, and reports the result returned by the call.  After processing  all  arguments,
       the child process terminates.

       The  parent  process acts as the supervisor, listening for the notifications that are generated when the target process calls mkdir(2).  When such a notification
       occurs, the supervisor examines the memory of the target process (using /proc/[pid]/mem) to discover the pathname argument that  was  supplied  to  the  mkdir(2)
       call, and performs one of the following actions:

       • If the pathname begins with the prefix "/tmp/", then the supervisor attempts to create the specified directory, and then spoofs a return for the target process
         based on the return value of the supervisor's mkdir(2) call.  In the event that that call succeeds, the spoofed success return value is the length of the path‐
         name.

       • If  the pathname begins with "./" (i.e., it is a relative pathname), the supervisor sends a SECCOMP_USER_NOTIF_FLAG_CONTINUE response to the kernel to say that
         the kernel should execute the target process's mkdir(2) call.

       • If the pathname begins with some other prefix, the supervisor spoofs an error return for the target process, so that the target process's mkdir(2) call appears
         to fail with the error EOPNOTSUPP ("Operation not supported").  Additionally, if the specified pathname is exactly "/bye", then the supervisor terminates.

       This  program  can be used to demonstrate various aspects of the behavior of the seccomp user-space notification mechanism.  To help aid such demonstrations, the
       program logs various messages to show the operation of the target process (lines prefixed "T:") and the supervisor (indented lines prefixed "S:").

       In the following example, the target attempts to create the directory /tmp/x.  Upon receiving the notification, the supervisor creates the directory on the  tar‐
       get's behalf, and spoofs a success return to be received by the target process's mkdir(2) call.

           $ ./seccomp_unotify /tmp/x
           T: PID = 23168

           T: about to mkdir("/tmp/x")
                   S: got notification (ID 0x17445c4a0f4e0e3c) for PID 23168
                   S: executing: mkdir("/tmp/x", 0700)
                   S: success! spoofed return = 6
                   S: sending response (flags = 0; val = 6; error = 0)
           T: SUCCESS: mkdir(2) returned 6

           T: terminating
                   S: target has terminated; bye

       In  the  above output, note that the spoofed return value seen by the target process is 6 (the length of the pathname /tmp/x), whereas a normal mkdir(2) call re‐
       turns 0 on success.

       In the next example, the target attempts to create a directory using the relative pathname ./sub.  Since this pathname starts with "./", the supervisor  sends  a
       SECCOMP_USER_NOTIF_FLAG_CONTINUE response to the kernel, and the kernel then (successfully) executes the target process's mkdir(2) call.

           $ ./seccomp_unotify ./sub
           T: PID = 23204

           T: about to mkdir("./sub")
                   S: got notification (ID 0xddb16abe25b4c12) for PID 23204
                   S: target can execute system call
                   S: sending response (flags = 0x1; val = 0; error = 0)
           T: SUCCESS: mkdir(2) returned 0

           T: terminating
                   S: target has terminated; bye

       If  the  target process attempts to create a directory with a pathname that doesn't start with "." and doesn't begin with the prefix "/tmp/", then the supervisor
       spoofs an error return (EOPNOTSUPP, "Operation not  supported") for the target's mkdir(2) call (which is not executed):

           $ ./seccomp_unotify /xxx
           T: PID = 23178

           T: about to mkdir("/xxx")
                   S: got notification (ID 0xe7dc095d1c524e80) for PID 23178
                   S: spoofing error response (Operation not supported)
                   S: sending response (flags = 0; val = 0; error = -95)
           T: ERROR: mkdir(2): Operation not supported

           T: terminating
                   S: target has terminated; bye

       In the next example, the target process attempts to create a directory with the pathname /tmp/nosuchdir/b.  Upon receiving the notification, the  supervisor  at‐
       tempts  to  create that directory, but the mkdir(2) call fails because the directory /tmp/nosuchdir does not exist.  Consequently, the supervisor spoofs an error
       return that passes the error that it received back to the target process's mkdir(2) call.

           $ ./seccomp_unotify /tmp/nosuchdir/b
           T: PID = 23199

           T: about to mkdir("/tmp/nosuchdir/b")
                   S: got notification (ID 0x8744454293506046) for PID 23199
                   S: executing: mkdir("/tmp/nosuchdir/b", 0700)
                   S: failure! (errno = 2; No such file or directory)
                   S: sending response (flags = 0; val = 0; error = -2)
           T: ERROR: mkdir(2): No such file or directory

           T: terminating
                   S: target has terminated; bye

       If the supervisor receives a notification and sees that the argument of the target's mkdir(2) is the string "/bye", then (as well as spoofing an  EOPNOTSUPP  er‐
       ror),  the  supervisor  terminates.   If  the  target  process  subsequently  executes  another  mkdir(2)  that  triggers  its  seccomp filter to return the SEC‐
       COMP_RET_USER_NOTIF action value, then the kernel causes the target process's system call to fail with the error ENOSYS ("Function not  implemented").   This  is
       demonstrated by the following example:

           $ ./seccomp_unotify /bye /tmp/y
           T: PID = 23185

           T: about to mkdir("/bye")
                   S: got notification (ID 0xa81236b1d2f7b0f4) for PID 23185
                   S: spoofing error response (Operation not supported)
                   S: sending response (flags = 0; val = 0; error = -95)
                   S: terminating **********
           T: ERROR: mkdir(2): Operation not supported

           T: about to mkdir("/tmp/y")
           T: ERROR: mkdir(2): Function not implemented

           T: terminating

   Program source
       #define _GNU_SOURCE
       #include <errno.h>
       #include <fcntl.h>
       #include <limits.h>
       #include <linux/audit.h>
       #include <linux/filter.h>
       #include <linux/seccomp.h>
       #include <signal.h>
       #include <stdbool.h>
       #include <stddef.h>
       #include <stdint.h>
       #include <stdio.h>
       #include <stdlib.h>
       #include <sys/socket.h>
       #include <sys/ioctl.h>
       #include <sys/prctl.h>
       #include <sys/stat.h>
       #include <sys/types.h>
       #include <sys/un.h>
       #include <sys/syscall.h>
       #include <unistd.h>

       #define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
                               } while (0)

       /* Send the file descriptor 'fd' over the connected UNIX domain socket
          'sockfd'. Returns 0 on success, or -1 on error. */

       static int
       sendfd(int sockfd, int fd)
       {
           struct msghdr msgh;
           struct iovec iov;
           int data;
           struct cmsghdr *cmsgp;

           /* Allocate a char array of suitable size to hold the ancillary data.
              However, since this buffer is in reality a 'struct cmsghdr', use a
              union to ensure that it is suitably aligned. */
           union {
               char   buf[CMSG_SPACE(sizeof(int))];
                               /* Space large enough to hold an 'int' */
               struct cmsghdr align;
           } controlMsg;

           /* The 'msg_name' field can be used to specify the address of the
              destination socket when sending a datagram. However, we do not
              need to use this field because 'sockfd' is a connected socket. */

           msgh.msg_name = NULL;
           msgh.msg_namelen = 0;

           /* On Linux, we must transmit at least one byte of real data in
              order to send ancillary data. We transmit an arbitrary integer
              whose value is ignored by recvfd(). */

           msgh.msg_iov = &iov;
           msgh.msg_iovlen = 1;
           iov.iov_base = &data;
           iov.iov_len = sizeof(int);
           data = 12345;

           /* Set 'msghdr' fields that describe ancillary data */

           msgh.msg_control = controlMsg.buf;
           msgh.msg_controllen = sizeof(controlMsg.buf);

           /* Set up ancillary data describing file descriptor to send */

           cmsgp = CMSG_FIRSTHDR(&msgh);
           cmsgp->cmsg_level = SOL_SOCKET;
           cmsgp->cmsg_type = SCM_RIGHTS;
           cmsgp->cmsg_len = CMSG_LEN(sizeof(int));
           memcpy(CMSG_DATA(cmsgp), &fd, sizeof(int));

           /* Send real plus ancillary data */

           if (sendmsg(sockfd, &msgh, 0) == -1)
               return -1;

           return 0;
       }

       /* Receive a file descriptor on a connected UNIX domain socket. Returns
          the received file descriptor on success, or -1 on error. */

       static int
       recvfd(int sockfd)
       {
           struct msghdr msgh;
           struct iovec iov;
           int data, fd;
           ssize_t nr;

           /* Allocate a char buffer for the ancillary data. See the comments
              in sendfd() */
           union {
               char   buf[CMSG_SPACE(sizeof(int))];
               struct cmsghdr align;
           } controlMsg;
           struct cmsghdr *cmsgp;

           /* The 'msg_name' field can be used to obtain the address of the
              sending socket. However, we do not need this information. */

           msgh.msg_name = NULL;
           msgh.msg_namelen = 0;

           /* Specify buffer for receiving real data */

           msgh.msg_iov = &iov;
           msgh.msg_iovlen = 1;
           iov.iov_base = &data;       /* Real data is an 'int' */
           iov.iov_len = sizeof(int);

           /* Set 'msghdr' fields that describe ancillary data */

           msgh.msg_control = controlMsg.buf;
           msgh.msg_controllen = sizeof(controlMsg.buf);

           /* Receive real plus ancillary data; real data is ignored */

           nr = recvmsg(sockfd, &msgh, 0);
           if (nr == -1)
               return -1;

           cmsgp = CMSG_FIRSTHDR(&msgh);

           /* Check the validity of the 'cmsghdr' */

           if (cmsgp == NULL ||
                   cmsgp->cmsg_len != CMSG_LEN(sizeof(int)) ||
                   cmsgp->cmsg_level != SOL_SOCKET ||
                   cmsgp->cmsg_type != SCM_RIGHTS) {
               errno = EINVAL;
               return -1;
           }

           /* Return the received file descriptor to our caller */

           memcpy(&fd, CMSG_DATA(cmsgp), sizeof(int));
           return fd;
       }

       static void
       sigchldHandler(int sig)
       {
           char msg[] = "\tS: target has terminated; bye\n";

           write(STDOUT_FILENO, msg, sizeof(msg) - 1);
           _exit(EXIT_SUCCESS);
       }

       static int
       seccomp(unsigned int operation, unsigned int flags, void *args)
       {
           return syscall(__NR_seccomp, operation, flags, args);
       }

       /* The following is the x86-64-specific BPF boilerplate code for checking
          that the BPF program is running on the right architecture + ABI. At
          completion of these instructions, the accumulator contains the system
          call number. */

       /* For the x32 ABI, all system call numbers have bit 30 set */

       #define X32_SYSCALL_BIT         0x40000000

       #define X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR \
               BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \
                       (offsetof(struct seccomp_data, arch))), \
               BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 0, 2), \
               BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \
                        (offsetof(struct seccomp_data, nr))), \
               BPF_JUMP(BPF_JMP | BPF_JGE | BPF_K, X32_SYSCALL_BIT, 0, 1), \
               BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS)

       /* installNotifyFilter() installs a seccomp filter that generates
          user-space notifications (SECCOMP_RET_USER_NOTIF) when the process
          calls mkdir(2); the filter allows all other system calls.

          The function return value is a file descriptor from which the
          user-space notifications can be fetched. */

       static int
       installNotifyFilter(void)
       {
           struct sock_filter filter[] = {
               X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR,

               /* mkdir() triggers notification to user-space supervisor */

               BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_mkdir, 0, 1),
               BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_USER_NOTIF),

               /* Every other system call is allowed */

               BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
           };

           struct sock_fprog prog = {
               .len = sizeof(filter) / sizeof(filter[0]),
               .filter = filter,
           };

           /* Install the filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER flag;
              as a result, seccomp() returns a notification file descriptor. */

           int notifyFd = seccomp(SECCOMP_SET_MODE_FILTER,
                                  SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
           if (notifyFd == -1)
               errExit("seccomp-install-notify-filter");

           return notifyFd;
       }

       /* Close a pair of sockets created by socketpair() */

       static void
       closeSocketPair(int sockPair[2])
       {
           if (close(sockPair[0]) == -1)
               errExit("closeSocketPair-close-0");
           if (close(sockPair[1]) == -1)
               errExit("closeSocketPair-close-1");
       }

       /* Implementation of the target process; create a child process that:

          (1) installs a seccomp filter with the
              SECCOMP_FILTER_FLAG_NEW_LISTENER flag;
          (2) writes the seccomp notification file descriptor returned from
              the previous step onto the UNIX domain socket, 'sockPair[0]';
          (3) calls mkdir(2) for each element of 'argv'.

          The function return value in the parent is the PID of the child
          process; the child does not return from this function. */

       static pid_t
       targetProcess(int sockPair[2], char *argv[])
       {
           pid_t targetPid = fork();
           if (targetPid == -1)
               errExit("fork");

           if (targetPid > 0)          /* In parent, return PID of child */
               return targetPid;

           /* Child falls through to here */

           printf("T: PID = %ld\n", (long) getpid());

           /* Install seccomp filter(s) */

           if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0))
               errExit("prctl");

           int notifyFd = installNotifyFilter();

           /* Pass the notification file descriptor to the tracing process over
              a UNIX domain socket */

           if (sendfd(sockPair[0], notifyFd) == -1)
               errExit("sendfd");

           /* Notification and socket FDs are no longer needed in target */

           if (close(notifyFd) == -1)
               errExit("close-target-notify-fd");

           closeSocketPair(sockPair);

           /* Perform a mkdir() call for each of the command-line arguments */

           for (char **ap = argv; *ap != NULL; ap++) {
               printf("\nT: about to mkdir(\"%s\")\n", *ap);

               int s = mkdir(*ap, 0700);
               if (s == -1)
                   perror("T: ERROR: mkdir(2)");
               else
                   printf("T: SUCCESS: mkdir(2) returned %d\n", s);
           }

           printf("\nT: terminating\n");
           exit(EXIT_SUCCESS);
       }

       /* Check that the notification ID provided by a SECCOMP_IOCTL_NOTIF_RECV
          operation is still valid. It will no longer be valid if the target
          process has terminated or is no longer blocked in the system call that
          generated the notification (because it was interrupted by a signal).

          This operation can be used when doing such things as accessing
          /proc/PID files in the target process in order to avoid TOCTOU race
          conditions where the PID that is returned by SECCOMP_IOCTL_NOTIF_RECV
          terminates and is reused by another process. */

       static bool
       cookieIsValid(int notifyFd, uint64_t id)
       {
           return ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ID_VALID, &id) == 0;
       }

       /* Access the memory of the target process in order to fetch the
          pathname referred to by the system call argument 'argNum' in
          'req->data.args[]'.  The pathname is returned in 'path',
          a buffer of 'len' bytes allocated by the caller.

          Returns true if the pathname is successfully fetched, and false
          otherwise. For possible causes of failure, see the comments below. */

       static bool
       getTargetPathname(struct seccomp_notif *req, int notifyFd,
                         int argNum, char *path, size_t len)
       {
           char procMemPath[PATH_MAX];

           snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", req->pid);

           int procMemFd = open(procMemPath, O_RDONLY | O_CLOEXEC);
           if (procMemFd == -1)
               return false;

           /* Check that the process whose info we are accessing is still alive
              and blocked in the system call that caused the notification.
              If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed in
              cookieIsValid()) succeeded, we know that the /proc/PID/mem file
              descriptor that we opened corresponded to the process for which we
              received a notification. If that process subsequently terminates,
              then read() on that file descriptor will return 0 (EOF). */

           if (!cookieIsValid(notifyFd, req->id)) {
               close(procMemFd);
               return false;
           }

           /* Read bytes at the location containing the pathname argument */

           ssize_t nread = pread(procMemFd, path, len, req->data.args[argNum]);

           close(procMemFd);

           if (nread <= 0)
               return false;

           /* Once again check that the notification ID is still valid. The
              case we are particularly concerned about here is that just
              before we fetched the pathname, the target's blocked system
              call was interrupted by a signal handler, and after the handler
              returned, the target carried on execution (past the interrupted
              system call). In that case, we have no guarantees about what we
              are reading, since the target's memory may have been arbitrarily
              changed by subsequent operations. */

           if (!cookieIsValid(notifyFd, req->id)) {
               perror("\tS: notification ID check failed!!!");
               return false;
           }

           /* Even if the target's system call was not interrupted by a signal,
              we have no guarantees about what was in the memory of the target
              process. (The memory may have been modified by another thread, or
              even by an external attacking process.) We therefore treat the
              buffer returned by pread() as untrusted input. The buffer should
              contain a terminating null byte; if not, then we will trigger an
              error for the target process. */

           if (strnlen(path, nread) < nread)
               return true;

           return false;
       }

       /* Allocate buffers for the seccomp user-space notification request and
          response structures. It is the caller's responsibility to free the
          buffers returned via 'req' and 'resp'. */

       static void
       allocSeccompNotifBuffers(struct seccomp_notif **req,
               struct seccomp_notif_resp **resp,
               struct seccomp_notif_sizes *sizes)
       {
           /* Discover the sizes of the structures that are used to receive
              notifications and send notification responses, and allocate
              buffers of those sizes. */

           if (seccomp(SECCOMP_GET_NOTIF_SIZES, 0, sizes) == -1)
               errExit("seccomp-SECCOMP_GET_NOTIF_SIZES");

           *req = malloc(sizes->seccomp_notif);
           if (*req == NULL)
               errExit("malloc-seccomp_notif");

           /* When allocating the response buffer, we must allow for the fact
              that the user-space binary may have been built with user-space
              headers where 'struct seccomp_notif_resp' is bigger than the
              response buffer expected by the (older) kernel. Therefore, we
              allocate a buffer that is the maximum of the two sizes. This
              ensures that if the supervisor places bytes into the response
              structure that are past the response size that the kernel expects,
              then the supervisor is not touching an invalid memory location. */

           size_t resp_size = sizes->seccomp_notif_resp;
           if (sizeof(struct seccomp_notif_resp) > resp_size)
               resp_size = sizeof(struct seccomp_notif_resp);

           *resp = malloc(resp_size);
           if (resp == NULL)
               errExit("malloc-seccomp_notif_resp");

       }

       /* Handle notifications that arrive via the SECCOMP_RET_USER_NOTIF file
          descriptor, 'notifyFd'. */

       static void
       handleNotifications(int notifyFd)
       {
           struct seccomp_notif_sizes sizes;
           struct seccomp_notif *req;
           struct seccomp_notif_resp *resp;
           char path[PATH_MAX];

           allocSeccompNotifBuffers(&req, &resp, &sizes);

           /* Loop handling notifications */

           for (;;) {

               /* Wait for next notification, returning info in '*req' */

               memset(req, 0, sizes.seccomp_notif);
               if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_RECV, req) == -1) {
                   if (errno == EINTR)
                       continue;
                   errExit("\tS: ioctl-SECCOMP_IOCTL_NOTIF_RECV");
               }

               printf("\tS: got notification (ID %#llx) for PID %d\n",
                       req->id, req->pid);

               /* The only system call that can generate a notification event
                  is mkdir(2). Nevertheless, we check that the notified system
                  call is indeed mkdir() as kind of future-proofing of this
                  code in case the seccomp filter is later modified to
                  generate notifications for other system calls. */

               if (req->data.nr != __NR_mkdir) {
                   printf("\tS: notification contained unexpected "
                           "system call number; bye!!!\n");
                   exit(EXIT_FAILURE);
               }

               bool pathOK = getTargetPathname(req, notifyFd, 0, path,
                                               sizeof(path));

               /* Prepopulate some fields of the response */

               resp->id = req->id;     /* Response includes notification ID */
               resp->flags = 0;
               resp->val = 0;

               /* If getTargetPathname() failed, trigger an EINVAL error
                  response (sending this response may yield an error if the
                  failure occurred because the notification ID was no longer
                  valid); if the directory is in /tmp, then create it on behalf
                  of the supervisor; if the pathname starts with '.', tell the
                  kernel to let the target process execute the mkdir();
                  otherwise, give an error for a directory pathname in any other
                  location. */

               if (!pathOK) {
                   resp->error = -EINVAL;
                   printf("\tS: spoofing error for invalid pathname (%s)\n",
                           strerror(-resp->error));
               } else if (strncmp(path, "/tmp/", strlen("/tmp/")) == 0) {
                   printf("\tS: executing: mkdir(\"%s\", %#llo)\n",
                           path, req->data.args[1]);

                   if (mkdir(path, req->data.args[1]) == 0) {
                       resp->error = 0;            /* "Success" */
                       resp->val = strlen(path);   /* Used as return value of
                                                      mkdir() in target */
                       printf("\tS: success! spoofed return = %lld\n",
                               resp->val);
                   } else {

                       /* If mkdir() failed in the supervisor, pass the error
                          back to the target */

                       resp->error = -errno;
                       printf("\tS: failure! (errno = %d; %s)\n", errno,
                               strerror(errno));
                   }
               } else if (strncmp(path, "./", strlen("./")) == 0) {
                   resp->error = resp->val = 0;
                   resp->flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
                   printf("\tS: target can execute system call\n");
               } else {
                   resp->error = -EOPNOTSUPP;
                   printf("\tS: spoofing error response (%s)\n",
                           strerror(-resp->error));
               }

               /* Send a response to the notification */

               printf("\tS: sending response "
                       "(flags = %#x; val = %lld; error = %d)\n",
                       resp->flags, resp->val, resp->error);

               if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_SEND, resp) == -1) {
                   if (errno == ENOENT)
                       printf("\tS: response failed with ENOENT; "
                               "perhaps target process's syscall was "
                               "interrupted by a signal?\n");
                   else
                       perror("ioctl-SECCOMP_IOCTL_NOTIF_SEND");
               }

               /* If the pathname is just "/bye", then the supervisor breaks out
                  of the loop and terminates. This allows us to see what happens
                  if the target process makes further calls to mkdir(2). */

               if (strcmp(path, "/bye") == 0)
                   break;
           }

           free(req);
           free(resp);
           printf("\tS: terminating **********\n");
           exit(EXIT_FAILURE);
       }

       /* Implementation of the supervisor process:

          (1) obtains the notification file descriptor from 'sockPair[1]'
          (2) handles notifications that arrive on that file descriptor. */

       static void
       supervisor(int sockPair[2])
       {
           int notifyFd = recvfd(sockPair[1]);
           if (notifyFd == -1)
               errExit("recvfd");

           closeSocketPair(sockPair);  /* We no longer need the socket pair */

           handleNotifications(notifyFd);
       }

       int
       main(int argc, char *argv[])
       {
           int sockPair[2];

           setbuf(stdout, NULL);

           if (argc < 2) {
               fprintf(stderr, "At least one pathname argument is required\n");
               exit(EXIT_FAILURE);
           }

           /* Create a UNIX domain socket that is used to pass the seccomp
              notification file descriptor from the target process to the
              supervisor process. */

           if (socketpair(AF_UNIX, SOCK_STREAM, 0, sockPair) == -1)
               errExit("socketpair");

           /* Create a child process--the "target"--that installs seccomp
              filtering. The target process writes the seccomp notification
              file descriptor onto 'sockPair[0]' and then calls mkdir(2) for
              each directory in the command-line arguments. */

           (void) targetProcess(sockPair, &argv[optind]);

           /* Catch SIGCHLD when the target terminates, so that the
              supervisor can also terminate. */

           struct sigaction sa;
           sa.sa_handler = sigchldHandler;
           sa.sa_flags = 0;
           sigemptyset(&sa.sa_mask);
           if (sigaction(SIGCHLD, &sa, NULL) == -1)
               errExit("sigaction");

           supervisor(sockPair);

           exit(EXIT_SUCCESS);
       }

SEE ALSO
       ioctl(2), pidfd_getfd(2), pidfd_open(2), seccomp(2)

       A further example program can be found in the kernel source file samples/seccomp/user-trap.c.

Linux                                                                          2021-06-20                                                             SECCOMP_UNOTIFY(2)