|  | The execve system call can grant a newly-started program privileges that | 
|  | its parent did not have.  The most obvious examples are setuid/setgid | 
|  | programs and file capabilities.  To prevent the parent program from | 
|  | gaining these privileges as well, the kernel and user code must be | 
|  | careful to prevent the parent from doing anything that could subvert the | 
|  | child.  For example: | 
|  |  | 
|  | - The dynamic loader handles LD_* environment variables differently if | 
|  | a program is setuid. | 
|  |  | 
|  | - chroot is disallowed to unprivileged processes, since it would allow | 
|  | /etc/passwd to be replaced from the point of view of a process that | 
|  | inherited chroot. | 
|  |  | 
|  | - The exec code has special handling for ptrace. | 
|  |  | 
|  | These are all ad-hoc fixes.  The no_new_privs bit (since Linux 3.5) is a | 
|  | new, generic mechanism to make it safe for a process to modify its | 
|  | execution environment in a manner that persists across execve.  Any task | 
|  | can set no_new_privs.  Once the bit is set, it is inherited across fork, | 
|  | clone, and execve and cannot be unset.  With no_new_privs set, execve | 
|  | promises not to grant the privilege to do anything that could not have | 
|  | been done without the execve call.  For example, the setuid and setgid | 
|  | bits will no longer change the uid or gid; file capabilities will not | 
|  | add to the permitted set, and LSMs will not relax constraints after | 
|  | execve. | 
|  |  | 
|  | To set no_new_privs, use prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0). | 
|  |  | 
|  | Be careful, though: LSMs might also not tighten constraints on exec | 
|  | in no_new_privs mode.  (This means that setting up a general-purpose | 
|  | service launcher to set no_new_privs before execing daemons may | 
|  | interfere with LSM-based sandboxing.) | 
|  |  | 
|  | Note that no_new_privs does not prevent privilege changes that do not | 
|  | involve execve.  An appropriately privileged task can still call | 
|  | setuid(2) and receive SCM_RIGHTS datagrams. | 
|  |  | 
|  | There are two main use cases for no_new_privs so far: | 
|  |  | 
|  | - Filters installed for the seccomp mode 2 sandbox persist across | 
|  | execve and can change the behavior of newly-executed programs. | 
|  | Unprivileged users are therefore only allowed to install such filters | 
|  | if no_new_privs is set. | 
|  |  | 
|  | - By itself, no_new_privs can be used to reduce the attack surface | 
|  | available to an unprivileged user.  If everything running with a | 
|  | given uid has no_new_privs set, then that uid will be unable to | 
|  | escalate its privileges by directly attacking setuid, setgid, and | 
|  | fcap-using binaries; it will need to compromise something without the | 
|  | no_new_privs bit set first. | 
|  |  | 
|  | In the future, other potentially dangerous kernel features could become | 
|  | available to unprivileged tasks if no_new_privs is set.  In principle, | 
|  | several options to unshare(2) and clone(2) would be safe when | 
|  | no_new_privs is set, and no_new_privs + chroot is considerable less | 
|  | dangerous than chroot by itself. |