Documentation/ia64/fsys.txt - nest-cam/v366/linux - Git at Google

 -*-Mode: outline-*-

 		Light-weight System Calls for IA-64
 		-----------------------------------

 		        Started: 13-Jan-2003
 		    Last update: 27-Sep-2003

 	              David Mosberger-Tang
 		      <davidm@hpl.hp.com>

 Using the "epc" instruction effectively introduces a new mode of
 execution to the ia64 linux kernel.  We call this mode the
 "fsys-mode".  To recap, the normal states of execution are:

   - kernel mode:
 	Both the register stack and the memory stack have been
 	switched over to kernel memory.  The user-level state is saved
 	in a pt-regs structure at the top of the kernel memory stack.

   - user mode:
 	Both the register stack and the kernel stack are in
 	user memory.  The user-level state is contained in the
 	CPU registers.

   - bank 0 interruption-handling mode:
 	This is the non-interruptible state which all
 	interruption-handlers start execution in.  The user-level
 	state remains in the CPU registers and some kernel state may
 	be stored in bank 0 of registers r16-r31.

 In contrast, fsys-mode has the following special properties:

   - execution is at privilege level 0 (most-privileged)

   - CPU registers may contain a mixture of user-level and kernel-level
     state (it is the responsibility of the kernel to ensure that no
     security-sensitive kernel-level state is leaked back to
     user-level)

   - execution is interruptible and preemptible (an fsys-mode handler
     can disable interrupts and avoid all other interruption-sources
     to avoid preemption)

   - neither the memory-stack nor the register-stack can be trusted while
     in fsys-mode (they point to the user-level stacks, which may
     be invalid, or completely bogus addresses)

 In summary, fsys-mode is much more similar to running in user-mode
 than it is to running in kernel-mode.  Of course, given that the
 privilege level is at level 0, this means that fsys-mode requires some
 care (see below).


 * How to tell fsys-mode

 Linux operates in fsys-mode when (a) the privilege level is 0 (most
 privileged) and (b) the stacks have NOT been switched to kernel memory
 yet.  For convenience, the header file <asm-ia64/ptrace.h> provides
 three macros:

 	user_mode(regs)
 	user_stack(task,regs)
 	fsys_mode(task,regs)

 The "regs" argument is a pointer to a pt_regs structure.  The "task"
 argument is a pointer to the task structure to which the "regs"
 pointer belongs to.  user_mode() returns TRUE if the CPU state pointed
 to by "regs" was executing in user mode (privilege level 3).
 user_stack() returns TRUE if the state pointed to by "regs" was
 executing on the user-level stack(s).  Finally, fsys_mode() returns
 TRUE if the CPU state pointed to by "regs" was executing in fsys-mode.
 The fsys_mode() macro is equivalent to the expression:

 	!user_mode(regs) && user_stack(task,regs)

 * How to write an fsyscall handler

 The file arch/ia64/kernel/fsys.S contains a table of fsyscall-handlers
 (fsyscall_table).  This table contains one entry for each system call.
 By default, a system call is handled by fsys_fallback_syscall().  This
 routine takes care of entering (full) kernel mode and calling the
 normal Linux system call handler.  For performance-critical system
 calls, it is possible to write a hand-tuned fsyscall_handler.  For
 example, fsys.S contains fsys_getpid(), which is a hand-tuned version
 of the getpid() system call.

 The entry and exit-state of an fsyscall handler is as follows:

 ** Machine state on entry to fsyscall handler:

  - r10	  = 0
  - r11	  = saved ar.pfs (a user-level value)
  - r15	  = system call number
  - r16	  = "current" task pointer (in normal kernel-mode, this is in r13)
  - r32-r39 = system call arguments
  - b6	  = return address (a user-level value)
  - ar.pfs = previous frame-state (a user-level value)
  - PSR.be = cleared to zero (i.e., little-endian byte order is in effect)
  - all other registers may contain values passed in from user-mode

 ** Required machine state on exit to fsyscall handler:

  - r11	  = saved ar.pfs (as passed into the fsyscall handler)
  - r15	  = system call number (as passed into the fsyscall handler)
  - r32-r39 = system call arguments (as passed into the fsyscall handler)
  - b6	  = return address (as passed into the fsyscall handler)
  - ar.pfs = previous frame-state (as passed into the fsyscall handler)

 Fsyscall handlers can execute with very little overhead, but with that
 speed comes a set of restrictions:

  o Fsyscall-handlers MUST check for any pending work in the flags
    member of the thread-info structure and if any of the
    TIF_ALLWORK_MASK flags are set, the handler needs to fall back on
    doing a full system call (by calling fsys_fallback_syscall).

  o Fsyscall-handlers MUST preserve incoming arguments (r32-r39, r11,
    r15, b6, and ar.pfs) because they will be needed in case of a
    system call restart.  Of course, all "preserved" registers also
    must be preserved, in accordance to the normal calling conventions.

  o Fsyscall-handlers MUST check argument registers for containing a
    NaT value before using them in any way that could trigger a
    NaT-consumption fault.  If a system call argument is found to
    contain a NaT value, an fsyscall-handler may return immediately
    with r8=EINVAL, r10=-1.

  o Fsyscall-handlers MUST NOT use the "alloc" instruction or perform
    any other operation that would trigger mandatory RSE
    (register-stack engine) traffic.

  o Fsyscall-handlers MUST NOT write to any stacked registers because
    it is not safe to assume that user-level called a handler with the
    proper number of arguments.

  o Fsyscall-handlers need to be careful when accessing per-CPU variables:
    unless proper safe-guards are taken (e.g., interruptions are avoided),
    execution may be pre-empted and resumed on another CPU at any given
    time.

  o Fsyscall-handlers must be careful not to leak sensitive kernel'
    information back to user-level.  In particular, before returning to
    user-level, care needs to be taken to clear any scratch registers
    that could contain sensitive information (note that the current
    task pointer is not considered sensitive: it's already exposed
    through ar.k6).

  o Fsyscall-handlers MUST NOT access user-memory without first
    validating access-permission (this can be done typically via
    probe.r.fault and/or probe.w.fault) and without guarding against
    memory access exceptions (this can be done with the EX() macros
    defined by asmmacro.h).

 The above restrictions may seem draconian, but remember that it's
 possible to trade off some of the restrictions by paying a slightly
 higher overhead.  For example, if an fsyscall-handler could benefit
 from the shadow register bank, it could temporarily disable PSR.i and
 PSR.ic, switch to bank 0 (bsw.0) and then use the shadow registers as
 needed.  In other words, following the above rules yields extremely
 fast system call execution (while fully preserving system call
 semantics), but there is also a lot of flexibility in handling more
 complicated cases.

 * Signal handling

 The delivery of (asynchronous) signals must be delayed until fsys-mode
 is exited.  This is accomplished with the help of the lower-privilege
 transfer trap: arch/ia64/kernel/process.c:do_notify_resume_user()
 checks whether the interrupted task was in fsys-mode and, if so, sets
 PSR.lp and returns immediately.  When fsys-mode is exited via the
 "br.ret" instruction that lowers the privilege level, a trap will
 occur.  The trap handler clears PSR.lp again and returns immediately.
 The kernel exit path then checks for and delivers any pending signals.

 * PSR Handling

 The "epc" instruction doesn't change the contents of PSR at all.  This
 is in contrast to a regular interruption, which clears almost all
 bits.  Because of that, some care needs to be taken to ensure things
 work as expected.  The following discussion describes how each PSR bit
 is handled.

 PSR.be	Cleared when entering fsys-mode.  A srlz.d instruction is used
 	to ensure the CPU is in little-endian mode before the first
 	load/store instruction is executed.  PSR.be is normally NOT
 	restored upon return from an fsys-mode handler.  In other
 	words, user-level code must not rely on PSR.be being preserved
 	across a system call.
 PSR.up	Unchanged.
 PSR.ac	Unchanged.
 PSR.mfl Unchanged.  Note: fsys-mode handlers must not write-registers!
 PSR.mfh	Unchanged.  Note: fsys-mode handlers must not write-registers!
 PSR.ic	Unchanged.  Note: fsys-mode handlers can clear the bit, if needed.
 PSR.i	Unchanged.  Note: fsys-mode handlers can clear the bit, if needed.
 PSR.pk	Unchanged.
 PSR.dt	Unchanged.
 PSR.dfl	Unchanged.  Note: fsys-mode handlers must not write-registers!
 PSR.dfh	Unchanged.  Note: fsys-mode handlers must not write-registers!
 PSR.sp	Unchanged.
 PSR.pp	Unchanged.
 PSR.di	Unchanged.
 PSR.si	Unchanged.
 PSR.db	Unchanged.  The kernel prevents user-level from setting a hardware
 	breakpoint that triggers at any privilege level other than 3 (user-mode).
 PSR.lp	Unchanged.
 PSR.tb	Lazy redirect.  If a taken-branch trap occurs while in
 	fsys-mode, the trap-handler modifies the saved machine state
 	such that execution resumes in the gate page at
 	syscall_via_break(), with privilege level 3.  Note: the
 	taken branch would occur on the branch invoking the
 	fsyscall-handler, at which point, by definition, a syscall
 	restart is still safe.  If the system call number is invalid,
 	the fsys-mode handler will return directly to user-level.  This
 	return will trigger a taken-branch trap, but since the trap is
 	taken _after_ restoring the privilege level, the CPU has already
 	left fsys-mode, so no special treatment is needed.
 PSR.rt	Unchanged.
 PSR.cpl	Cleared to 0.
 PSR.is	Unchanged (guaranteed to be 0 on entry to the gate page).
 PSR.mc	Unchanged.
 PSR.it	Unchanged (guaranteed to be 1).
 PSR.id	Unchanged.  Note: the ia64 linux kernel never sets this bit.
 PSR.da	Unchanged.  Note: the ia64 linux kernel never sets this bit.
 PSR.dd	Unchanged.  Note: the ia64 linux kernel never sets this bit.
 PSR.ss	Lazy redirect.  If set, "epc" will cause a Single Step Trap to
 	be taken.  The trap handler then modifies the saved machine
 	state such that execution resumes in the gate page at
 	syscall_via_break(), with privilege level 3.
 PSR.ri	Unchanged.
 PSR.ed	Unchanged.  Note: This bit could only have an effect if an fsys-mode
 	handler performed a speculative load that gets NaTted.  If so, this
 	would be the normal & expected behavior, so no special treatment is
 	needed.
 PSR.bn	Unchanged.  Note: fsys-mode handlers may clear the bit, if needed.
 	Doing so requires clearing PSR.i and PSR.ic as well.
 PSR.ia	Unchanged.  Note: the ia64 linux kernel never sets this bit.

 * Using fast system calls

 To use fast system calls, userspace applications need simply call
 __kernel_syscall_via_epc().  For example

 -- example fgettimeofday() call --
 -- fgettimeofday.S --

 #include <asm/asmmacro.h>

 GLOBAL_ENTRY(fgettimeofday)
 .prologue
 .save ar.pfs, r11
 mov r11 = ar.pfs
 .body

 mov r2 = 0xa000000000020660;;  // gate address
 			       // found by inspection of System.map for the
 			       // __kernel_syscall_via_epc() function.  See
 			       // below for how to do this for real.

 mov b7 = r2
 mov r15 = 1087		       // gettimeofday syscall
 ;;
 br.call.sptk.many b6 = b7
 ;;

 .restore sp

 mov ar.pfs = r11
 br.ret.sptk.many rp;;	      // return to caller
 END(fgettimeofday)

 -- end fgettimeofday.S --

 In reality, getting the gate address is accomplished by two extra
 values passed via the ELF auxiliary vector (include/asm-ia64/elf.h)

  o AT_SYSINFO : is the address of __kernel_syscall_via_epc()
  o AT_SYSINFO_EHDR : is the address of the kernel gate ELF DSO

 The ELF DSO is a pre-linked library that is mapped in by the kernel at
 the gate page.  It is a proper ELF shared object so, with a dynamic
 loader that recognises the library, you should be able to make calls to
 the exported functions within it as with any other shared library.
 AT_SYSINFO points into the kernel DSO at the
 __kernel_syscall_via_epc() function for historical reasons (it was
 used before the kernel DSO) and as a convenience.
	--Mode: outline--

	Light-weight System Calls for IA-64
	-----------------------------------

	Started: 13-Jan-2003
	Last update: 27-Sep-2003

	David Mosberger-Tang
	<davidm@hpl.hp.com>

	Using the "epc" instruction effectively introduces a new mode of
	execution to the ia64 linux kernel. We call this mode the
	"fsys-mode". To recap, the normal states of execution are:

	- kernel mode:
	Both the register stack and the memory stack have been
	switched over to kernel memory. The user-level state is saved
	in a pt-regs structure at the top of the kernel memory stack.

	- user mode:
	Both the register stack and the kernel stack are in
	user memory. The user-level state is contained in the
	CPU registers.

	- bank 0 interruption-handling mode:
	This is the non-interruptible state which all
	interruption-handlers start execution in. The user-level
	state remains in the CPU registers and some kernel state may
	be stored in bank 0 of registers r16-r31.

	In contrast, fsys-mode has the following special properties:

	- execution is at privilege level 0 (most-privileged)

	- CPU registers may contain a mixture of user-level and kernel-level
	state (it is the responsibility of the kernel to ensure that no
	security-sensitive kernel-level state is leaked back to
	user-level)

	- execution is interruptible and preemptible (an fsys-mode handler
	can disable interrupts and avoid all other interruption-sources
	to avoid preemption)

	- neither the memory-stack nor the register-stack can be trusted while
	in fsys-mode (they point to the user-level stacks, which may
	be invalid, or completely bogus addresses)

	In summary, fsys-mode is much more similar to running in user-mode
	than it is to running in kernel-mode. Of course, given that the
	privilege level is at level 0, this means that fsys-mode requires some
	care (see below).


	* How to tell fsys-mode

	Linux operates in fsys-mode when (a) the privilege level is 0 (most
	privileged) and (b) the stacks have NOT been switched to kernel memory
	yet. For convenience, the header file <asm-ia64/ptrace.h> provides
	three macros:

	user_mode(regs)
	user_stack(task,regs)
	fsys_mode(task,regs)

	The "regs" argument is a pointer to a pt_regs structure. The "task"
	argument is a pointer to the task structure to which the "regs"
	pointer belongs to. user_mode() returns TRUE if the CPU state pointed
	to by "regs" was executing in user mode (privilege level 3).
	user_stack() returns TRUE if the state pointed to by "regs" was
	executing on the user-level stack(s). Finally, fsys_mode() returns
	TRUE if the CPU state pointed to by "regs" was executing in fsys-mode.
	The fsys_mode() macro is equivalent to the expression:

	!user_mode(regs) && user_stack(task,regs)

	* How to write an fsyscall handler

	The file arch/ia64/kernel/fsys.S contains a table of fsyscall-handlers
	(fsyscall_table). This table contains one entry for each system call.
	By default, a system call is handled by fsys_fallback_syscall(). This
	routine takes care of entering (full) kernel mode and calling the
	normal Linux system call handler. For performance-critical system
	calls, it is possible to write a hand-tuned fsyscall_handler. For
	example, fsys.S contains fsys_getpid(), which is a hand-tuned version
	of the getpid() system call.

	The entry and exit-state of an fsyscall handler is as follows:

	** Machine state on entry to fsyscall handler:

	- r10 = 0
	- r11 = saved ar.pfs (a user-level value)
	- r15 = system call number
	- r16 = "current" task pointer (in normal kernel-mode, this is in r13)
	- r32-r39 = system call arguments
	- b6 = return address (a user-level value)
	- ar.pfs = previous frame-state (a user-level value)
	- PSR.be = cleared to zero (i.e., little-endian byte order is in effect)
	- all other registers may contain values passed in from user-mode

	** Required machine state on exit to fsyscall handler:

	- r11 = saved ar.pfs (as passed into the fsyscall handler)
	- r15 = system call number (as passed into the fsyscall handler)
	- r32-r39 = system call arguments (as passed into the fsyscall handler)
	- b6 = return address (as passed into the fsyscall handler)
	- ar.pfs = previous frame-state (as passed into the fsyscall handler)

	Fsyscall handlers can execute with very little overhead, but with that
	speed comes a set of restrictions:

	o Fsyscall-handlers MUST check for any pending work in the flags
	member of the thread-info structure and if any of the
	TIF_ALLWORK_MASK flags are set, the handler needs to fall back on
	doing a full system call (by calling fsys_fallback_syscall).

	o Fsyscall-handlers MUST preserve incoming arguments (r32-r39, r11,
	r15, b6, and ar.pfs) because they will be needed in case of a
	system call restart. Of course, all "preserved" registers also
	must be preserved, in accordance to the normal calling conventions.

	o Fsyscall-handlers MUST check argument registers for containing a
	NaT value before using them in any way that could trigger a
	NaT-consumption fault. If a system call argument is found to
	contain a NaT value, an fsyscall-handler may return immediately
	with r8=EINVAL, r10=-1.

	o Fsyscall-handlers MUST NOT use the "alloc" instruction or perform
	any other operation that would trigger mandatory RSE
	(register-stack engine) traffic.

	o Fsyscall-handlers MUST NOT write to any stacked registers because
	it is not safe to assume that user-level called a handler with the
	proper number of arguments.

	o Fsyscall-handlers need to be careful when accessing per-CPU variables:
	unless proper safe-guards are taken (e.g., interruptions are avoided),
	execution may be pre-empted and resumed on another CPU at any given
	time.

	o Fsyscall-handlers must be careful not to leak sensitive kernel'
	information back to user-level. In particular, before returning to
	user-level, care needs to be taken to clear any scratch registers
	that could contain sensitive information (note that the current
	task pointer is not considered sensitive: it's already exposed
	through ar.k6).

	o Fsyscall-handlers MUST NOT access user-memory without first
	validating access-permission (this can be done typically via
	probe.r.fault and/or probe.w.fault) and without guarding against
	memory access exceptions (this can be done with the EX() macros
	defined by asmmacro.h).

	The above restrictions may seem draconian, but remember that it's
	possible to trade off some of the restrictions by paying a slightly
	higher overhead. For example, if an fsyscall-handler could benefit
	from the shadow register bank, it could temporarily disable PSR.i and
	PSR.ic, switch to bank 0 (bsw.0) and then use the shadow registers as
	needed. In other words, following the above rules yields extremely
	fast system call execution (while fully preserving system call
	semantics), but there is also a lot of flexibility in handling more
	complicated cases.

	* Signal handling

	The delivery of (asynchronous) signals must be delayed until fsys-mode
	is exited. This is accomplished with the help of the lower-privilege
	transfer trap: arch/ia64/kernel/process.c:do_notify_resume_user()
	checks whether the interrupted task was in fsys-mode and, if so, sets
	PSR.lp and returns immediately. When fsys-mode is exited via the
	"br.ret" instruction that lowers the privilege level, a trap will
	occur. The trap handler clears PSR.lp again and returns immediately.
	The kernel exit path then checks for and delivers any pending signals.

	* PSR Handling

	The "epc" instruction doesn't change the contents of PSR at all. This
	is in contrast to a regular interruption, which clears almost all
	bits. Because of that, some care needs to be taken to ensure things
	work as expected. The following discussion describes how each PSR bit
	is handled.

	PSR.be Cleared when entering fsys-mode. A srlz.d instruction is used
	to ensure the CPU is in little-endian mode before the first
	load/store instruction is executed. PSR.be is normally NOT
	restored upon return from an fsys-mode handler. In other
	words, user-level code must not rely on PSR.be being preserved
	across a system call.
	PSR.up Unchanged.
	PSR.ac Unchanged.
	PSR.mfl Unchanged. Note: fsys-mode handlers must not write-registers!
	PSR.mfh Unchanged. Note: fsys-mode handlers must not write-registers!
	PSR.ic Unchanged. Note: fsys-mode handlers can clear the bit, if needed.
	PSR.i Unchanged. Note: fsys-mode handlers can clear the bit, if needed.
	PSR.pk Unchanged.
	PSR.dt Unchanged.
	PSR.dfl Unchanged. Note: fsys-mode handlers must not write-registers!
	PSR.dfh Unchanged. Note: fsys-mode handlers must not write-registers!
	PSR.sp Unchanged.
	PSR.pp Unchanged.
	PSR.di Unchanged.
	PSR.si Unchanged.
	PSR.db Unchanged. The kernel prevents user-level from setting a hardware
	breakpoint that triggers at any privilege level other than 3 (user-mode).
	PSR.lp Unchanged.
	PSR.tb Lazy redirect. If a taken-branch trap occurs while in
	fsys-mode, the trap-handler modifies the saved machine state
	such that execution resumes in the gate page at
	syscall_via_break(), with privilege level 3. Note: the
	taken branch would occur on the branch invoking the
	fsyscall-handler, at which point, by definition, a syscall
	restart is still safe. If the system call number is invalid,
	the fsys-mode handler will return directly to user-level. This
	return will trigger a taken-branch trap, but since the trap is
	taken _after_ restoring the privilege level, the CPU has already
	left fsys-mode, so no special treatment is needed.
	PSR.rt Unchanged.
	PSR.cpl Cleared to 0.
	PSR.is Unchanged (guaranteed to be 0 on entry to the gate page).
	PSR.mc Unchanged.
	PSR.it Unchanged (guaranteed to be 1).
	PSR.id Unchanged. Note: the ia64 linux kernel never sets this bit.
	PSR.da Unchanged. Note: the ia64 linux kernel never sets this bit.
	PSR.dd Unchanged. Note: the ia64 linux kernel never sets this bit.
	PSR.ss Lazy redirect. If set, "epc" will cause a Single Step Trap to
	be taken. The trap handler then modifies the saved machine
	state such that execution resumes in the gate page at
	syscall_via_break(), with privilege level 3.
	PSR.ri Unchanged.
	PSR.ed Unchanged. Note: This bit could only have an effect if an fsys-mode
	handler performed a speculative load that gets NaTted. If so, this
	would be the normal & expected behavior, so no special treatment is
	needed.
	PSR.bn Unchanged. Note: fsys-mode handlers may clear the bit, if needed.
	Doing so requires clearing PSR.i and PSR.ic as well.
	PSR.ia Unchanged. Note: the ia64 linux kernel never sets this bit.

	* Using fast system calls

	To use fast system calls, userspace applications need simply call
	__kernel_syscall_via_epc(). For example

	-- example fgettimeofday() call --
	-- fgettimeofday.S --

	#include <asm/asmmacro.h>

	GLOBAL_ENTRY(fgettimeofday)
	.prologue
	.save ar.pfs, r11
	mov r11 = ar.pfs
	.body

	mov r2 = 0xa000000000020660;; // gate address
	// found by inspection of System.map for the
	// __kernel_syscall_via_epc() function. See
	// below for how to do this for real.

	mov b7 = r2
	mov r15 = 1087 // gettimeofday syscall
	;;
	br.call.sptk.many b6 = b7
	;;

	.restore sp

	mov ar.pfs = r11
	br.ret.sptk.many rp;; // return to caller
	END(fgettimeofday)

	-- end fgettimeofday.S --

	In reality, getting the gate address is accomplished by two extra
	values passed via the ELF auxiliary vector (include/asm-ia64/elf.h)

	o AT_SYSINFO : is the address of __kernel_syscall_via_epc()
	o AT_SYSINFO_EHDR : is the address of the kernel gate ELF DSO

	The ELF DSO is a pre-linked library that is mapped in by the kernel at
	the gate page. It is a proper ELF shared object so, with a dynamic
	loader that recognises the library, you should be able to make calls to
	the exported functions within it as with any other shared library.
	AT_SYSINFO points into the kernel DSO at the
	__kernel_syscall_via_epc() function for historical reasons (it was
	used before the kernel DSO) and as a convenience.