|  | ============================== | 
|  | UNEVICTABLE LRU INFRASTRUCTURE | 
|  | ============================== | 
|  |  | 
|  | ======== | 
|  | CONTENTS | 
|  | ======== | 
|  |  | 
|  | (*) The Unevictable LRU | 
|  |  | 
|  | - The unevictable page list. | 
|  | - Memory control group interaction. | 
|  | - Marking address spaces unevictable. | 
|  | - Detecting Unevictable Pages. | 
|  | - vmscan's handling of unevictable pages. | 
|  |  | 
|  | (*) mlock()'d pages. | 
|  |  | 
|  | - History. | 
|  | - Basic management. | 
|  | - mlock()/mlockall() system call handling. | 
|  | - Filtering special vmas. | 
|  | - munlock()/munlockall() system call handling. | 
|  | - Migrating mlocked pages. | 
|  | - Compacting mlocked pages. | 
|  | - mmap(MAP_LOCKED) system call handling. | 
|  | - munmap()/exit()/exec() system call handling. | 
|  | - try_to_unmap(). | 
|  | - try_to_munlock() reverse map scan. | 
|  | - Page reclaim in shrink_*_list(). | 
|  |  | 
|  |  | 
|  | ============ | 
|  | INTRODUCTION | 
|  | ============ | 
|  |  | 
|  | This document describes the Linux memory manager's "Unevictable LRU" | 
|  | infrastructure and the use of this to manage several types of "unevictable" | 
|  | pages. | 
|  |  | 
|  | The document attempts to provide the overall rationale behind this mechanism | 
|  | and the rationale for some of the design decisions that drove the | 
|  | implementation.  The latter design rationale is discussed in the context of an | 
|  | implementation description.  Admittedly, one can obtain the implementation | 
|  | details - the "what does it do?" - by reading the code.  One hopes that the | 
|  | descriptions below add value by provide the answer to "why does it do that?". | 
|  |  | 
|  |  | 
|  | =================== | 
|  | THE UNEVICTABLE LRU | 
|  | =================== | 
|  |  | 
|  | The Unevictable LRU facility adds an additional LRU list to track unevictable | 
|  | pages and to hide these pages from vmscan.  This mechanism is based on a patch | 
|  | by Larry Woodman of Red Hat to address several scalability problems with page | 
|  | reclaim in Linux.  The problems have been observed at customer sites on large | 
|  | memory x86_64 systems. | 
|  |  | 
|  | To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of | 
|  | main memory will have over 32 million 4k pages in a single zone.  When a large | 
|  | fraction of these pages are not evictable for any reason [see below], vmscan | 
|  | will spend a lot of time scanning the LRU lists looking for the small fraction | 
|  | of pages that are evictable.  This can result in a situation where all CPUs are | 
|  | spending 100% of their time in vmscan for hours or days on end, with the system | 
|  | completely unresponsive. | 
|  |  | 
|  | The unevictable list addresses the following classes of unevictable pages: | 
|  |  | 
|  | (*) Those owned by ramfs. | 
|  |  | 
|  | (*) Those mapped into SHM_LOCK'd shared memory regions. | 
|  |  | 
|  | (*) Those mapped into VM_LOCKED [mlock()ed] VMAs. | 
|  |  | 
|  | The infrastructure may also be able to handle other conditions that make pages | 
|  | unevictable, either by definition or by circumstance, in the future. | 
|  |  | 
|  |  | 
|  | THE UNEVICTABLE PAGE LIST | 
|  | ------------------------- | 
|  |  | 
|  | The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list | 
|  | called the "unevictable" list and an associated page flag, PG_unevictable, to | 
|  | indicate that the page is being managed on the unevictable list. | 
|  |  | 
|  | The PG_unevictable flag is analogous to, and mutually exclusive with, the | 
|  | PG_active flag in that it indicates on which LRU list a page resides when | 
|  | PG_lru is set. | 
|  |  | 
|  | The Unevictable LRU infrastructure maintains unevictable pages on an additional | 
|  | LRU list for a few reasons: | 
|  |  | 
|  | (1) We get to "treat unevictable pages just like we treat other pages in the | 
|  | system - which means we get to use the same code to manipulate them, the | 
|  | same code to isolate them (for migrate, etc.), the same code to keep track | 
|  | of the statistics, etc..." [Rik van Riel] | 
|  |  | 
|  | (2) We want to be able to migrate unevictable pages between nodes for memory | 
|  | defragmentation, workload management and memory hotplug.  The linux kernel | 
|  | can only migrate pages that it can successfully isolate from the LRU | 
|  | lists.  If we were to maintain pages elsewhere than on an LRU-like list, | 
|  | where they can be found by isolate_lru_page(), we would prevent their | 
|  | migration, unless we reworked migration code to find the unevictable pages | 
|  | itself. | 
|  |  | 
|  |  | 
|  | The unevictable list does not differentiate between file-backed and anonymous, | 
|  | swap-backed pages.  This differentiation is only important while the pages are, | 
|  | in fact, evictable. | 
|  |  | 
|  | The unevictable list benefits from the "arrayification" of the per-zone LRU | 
|  | lists and statistics originally proposed and posted by Christoph Lameter. | 
|  |  | 
|  | The unevictable list does not use the LRU pagevec mechanism. Rather, | 
|  | unevictable pages are placed directly on the page's zone's unevictable list | 
|  | under the zone lru_lock.  This allows us to prevent the stranding of pages on | 
|  | the unevictable list when one task has the page isolated from the LRU and other | 
|  | tasks are changing the "evictability" state of the page. | 
|  |  | 
|  |  | 
|  | MEMORY CONTROL GROUP INTERACTION | 
|  | -------------------------------- | 
|  |  | 
|  | The unevictable LRU facility interacts with the memory control group [aka | 
|  | memory controller; see Documentation/cgroups/memory.txt] by extending the | 
|  | lru_list enum. | 
|  |  | 
|  | The memory controller data structure automatically gets a per-zone unevictable | 
|  | list as a result of the "arrayification" of the per-zone LRU lists (one per | 
|  | lru_list enum element).  The memory controller tracks the movement of pages to | 
|  | and from the unevictable list. | 
|  |  | 
|  | When a memory control group comes under memory pressure, the controller will | 
|  | not attempt to reclaim pages on the unevictable list.  This has a couple of | 
|  | effects: | 
|  |  | 
|  | (1) Because the pages are "hidden" from reclaim on the unevictable list, the | 
|  | reclaim process can be more efficient, dealing only with pages that have a | 
|  | chance of being reclaimed. | 
|  |  | 
|  | (2) On the other hand, if too many of the pages charged to the control group | 
|  | are unevictable, the evictable portion of the working set of the tasks in | 
|  | the control group may not fit into the available memory.  This can cause | 
|  | the control group to thrash or to OOM-kill tasks. | 
|  |  | 
|  |  | 
|  | MARKING ADDRESS SPACES UNEVICTABLE | 
|  | ---------------------------------- | 
|  |  | 
|  | For facilities such as ramfs none of the pages attached to the address space | 
|  | may be evicted.  To prevent eviction of any such pages, the AS_UNEVICTABLE | 
|  | address space flag is provided, and this can be manipulated by a filesystem | 
|  | using a number of wrapper functions: | 
|  |  | 
|  | (*) void mapping_set_unevictable(struct address_space *mapping); | 
|  |  | 
|  | Mark the address space as being completely unevictable. | 
|  |  | 
|  | (*) void mapping_clear_unevictable(struct address_space *mapping); | 
|  |  | 
|  | Mark the address space as being evictable. | 
|  |  | 
|  | (*) int mapping_unevictable(struct address_space *mapping); | 
|  |  | 
|  | Query the address space, and return true if it is completely | 
|  | unevictable. | 
|  |  | 
|  | These are currently used in two places in the kernel: | 
|  |  | 
|  | (1) By ramfs to mark the address spaces of its inodes when they are created, | 
|  | and this mark remains for the life of the inode. | 
|  |  | 
|  | (2) By SYSV SHM to mark SHM_LOCK'd address spaces until SHM_UNLOCK is called. | 
|  |  | 
|  | Note that SHM_LOCK is not required to page in the locked pages if they're | 
|  | swapped out; the application must touch the pages manually if it wants to | 
|  | ensure they're in memory. | 
|  |  | 
|  |  | 
|  | DETECTING UNEVICTABLE PAGES | 
|  | --------------------------- | 
|  |  | 
|  | The function page_evictable() in vmscan.c determines whether a page is | 
|  | evictable or not using the query function outlined above [see section "Marking | 
|  | address spaces unevictable"] to check the AS_UNEVICTABLE flag. | 
|  |  | 
|  | For address spaces that are so marked after being populated (as SHM regions | 
|  | might be), the lock action (eg: SHM_LOCK) can be lazy, and need not populate | 
|  | the page tables for the region as does, for example, mlock(), nor need it make | 
|  | any special effort to push any pages in the SHM_LOCK'd area to the unevictable | 
|  | list.  Instead, vmscan will do this if and when it encounters the pages during | 
|  | a reclamation scan. | 
|  |  | 
|  | On an unlock action (such as SHM_UNLOCK), the unlocker (eg: shmctl()) must scan | 
|  | the pages in the region and "rescue" them from the unevictable list if no other | 
|  | condition is keeping them unevictable.  If an unevictable region is destroyed, | 
|  | the pages are also "rescued" from the unevictable list in the process of | 
|  | freeing them. | 
|  |  | 
|  | page_evictable() also checks for mlocked pages by testing an additional page | 
|  | flag, PG_mlocked (as wrapped by PageMlocked()), which is set when a page is | 
|  | faulted into a VM_LOCKED vma, or found in a vma being VM_LOCKED. | 
|  |  | 
|  |  | 
|  | VMSCAN'S HANDLING OF UNEVICTABLE PAGES | 
|  | -------------------------------------- | 
|  |  | 
|  | If unevictable pages are culled in the fault path, or moved to the unevictable | 
|  | list at mlock() or mmap() time, vmscan will not encounter the pages until they | 
|  | have become evictable again (via munlock() for example) and have been "rescued" | 
|  | from the unevictable list.  However, there may be situations where we decide, | 
|  | for the sake of expediency, to leave a unevictable page on one of the regular | 
|  | active/inactive LRU lists for vmscan to deal with.  vmscan checks for such | 
|  | pages in all of the shrink_{active|inactive|page}_list() functions and will | 
|  | "cull" such pages that it encounters: that is, it diverts those pages to the | 
|  | unevictable list for the zone being scanned. | 
|  |  | 
|  | There may be situations where a page is mapped into a VM_LOCKED VMA, but the | 
|  | page is not marked as PG_mlocked.  Such pages will make it all the way to | 
|  | shrink_page_list() where they will be detected when vmscan walks the reverse | 
|  | map in try_to_unmap().  If try_to_unmap() returns SWAP_MLOCK, | 
|  | shrink_page_list() will cull the page at that point. | 
|  |  | 
|  | To "cull" an unevictable page, vmscan simply puts the page back on the LRU list | 
|  | using putback_lru_page() - the inverse operation to isolate_lru_page() - after | 
|  | dropping the page lock.  Because the condition which makes the page unevictable | 
|  | may change once the page is unlocked, putback_lru_page() will recheck the | 
|  | unevictable state of a page that it places on the unevictable list.  If the | 
|  | page has become unevictable, putback_lru_page() removes it from the list and | 
|  | retries, including the page_unevictable() test.  Because such a race is a rare | 
|  | event and movement of pages onto the unevictable list should be rare, these | 
|  | extra evictabilty checks should not occur in the majority of calls to | 
|  | putback_lru_page(). | 
|  |  | 
|  |  | 
|  | ============= | 
|  | MLOCKED PAGES | 
|  | ============= | 
|  |  | 
|  | The unevictable page list is also useful for mlock(), in addition to ramfs and | 
|  | SYSV SHM.  Note that mlock() is only available in CONFIG_MMU=y situations; in | 
|  | NOMMU situations, all mappings are effectively mlocked. | 
|  |  | 
|  |  | 
|  | HISTORY | 
|  | ------- | 
|  |  | 
|  | The "Unevictable mlocked Pages" infrastructure is based on work originally | 
|  | posted by Nick Piggin in an RFC patch entitled "mm: mlocked pages off LRU". | 
|  | Nick posted his patch as an alternative to a patch posted by Christoph Lameter | 
|  | to achieve the same objective: hiding mlocked pages from vmscan. | 
|  |  | 
|  | In Nick's patch, he used one of the struct page LRU list link fields as a count | 
|  | of VM_LOCKED VMAs that map the page.  This use of the link field for a count | 
|  | prevented the management of the pages on an LRU list, and thus mlocked pages | 
|  | were not migratable as isolate_lru_page() could not find them, and the LRU list | 
|  | link field was not available to the migration subsystem. | 
|  |  | 
|  | Nick resolved this by putting mlocked pages back on the lru list before | 
|  | attempting to isolate them, thus abandoning the count of VM_LOCKED VMAs.  When | 
|  | Nick's patch was integrated with the Unevictable LRU work, the count was | 
|  | replaced by walking the reverse map to determine whether any VM_LOCKED VMAs | 
|  | mapped the page.  More on this below. | 
|  |  | 
|  |  | 
|  | BASIC MANAGEMENT | 
|  | ---------------- | 
|  |  | 
|  | mlocked pages - pages mapped into a VM_LOCKED VMA - are a class of unevictable | 
|  | pages.  When such a page has been "noticed" by the memory management subsystem, | 
|  | the page is marked with the PG_mlocked flag.  This can be manipulated using the | 
|  | PageMlocked() functions. | 
|  |  | 
|  | A PG_mlocked page will be placed on the unevictable list when it is added to | 
|  | the LRU.  Such pages can be "noticed" by memory management in several places: | 
|  |  | 
|  | (1) in the mlock()/mlockall() system call handlers; | 
|  |  | 
|  | (2) in the mmap() system call handler when mmapping a region with the | 
|  | MAP_LOCKED flag; | 
|  |  | 
|  | (3) mmapping a region in a task that has called mlockall() with the MCL_FUTURE | 
|  | flag | 
|  |  | 
|  | (4) in the fault path, if mlocked pages are "culled" in the fault path, | 
|  | and when a VM_LOCKED stack segment is expanded; or | 
|  |  | 
|  | (5) as mentioned above, in vmscan:shrink_page_list() when attempting to | 
|  | reclaim a page in a VM_LOCKED VMA via try_to_unmap() | 
|  |  | 
|  | all of which result in the VM_LOCKED flag being set for the VMA if it doesn't | 
|  | already have it set. | 
|  |  | 
|  | mlocked pages become unlocked and rescued from the unevictable list when: | 
|  |  | 
|  | (1) mapped in a range unlocked via the munlock()/munlockall() system calls; | 
|  |  | 
|  | (2) munmap()'d out of the last VM_LOCKED VMA that maps the page, including | 
|  | unmapping at task exit; | 
|  |  | 
|  | (3) when the page is truncated from the last VM_LOCKED VMA of an mmapped file; | 
|  | or | 
|  |  | 
|  | (4) before a page is COW'd in a VM_LOCKED VMA. | 
|  |  | 
|  |  | 
|  | mlock()/mlockall() SYSTEM CALL HANDLING | 
|  | --------------------------------------- | 
|  |  | 
|  | Both [do_]mlock() and [do_]mlockall() system call handlers call mlock_fixup() | 
|  | for each VMA in the range specified by the call.  In the case of mlockall(), | 
|  | this is the entire active address space of the task.  Note that mlock_fixup() | 
|  | is used for both mlocking and munlocking a range of memory.  A call to mlock() | 
|  | an already VM_LOCKED VMA, or to munlock() a VMA that is not VM_LOCKED is | 
|  | treated as a no-op, and mlock_fixup() simply returns. | 
|  |  | 
|  | If the VMA passes some filtering as described in "Filtering Special Vmas" | 
|  | below, mlock_fixup() will attempt to merge the VMA with its neighbors or split | 
|  | off a subset of the VMA if the range does not cover the entire VMA.  Once the | 
|  | VMA has been merged or split or neither, mlock_fixup() will call | 
|  | populate_vma_page_range() to fault in the pages via get_user_pages() and to | 
|  | mark the pages as mlocked via mlock_vma_page(). | 
|  |  | 
|  | Note that the VMA being mlocked might be mapped with PROT_NONE.  In this case, | 
|  | get_user_pages() will be unable to fault in the pages.  That's okay.  If pages | 
|  | do end up getting faulted into this VM_LOCKED VMA, we'll handle them in the | 
|  | fault path or in vmscan. | 
|  |  | 
|  | Also note that a page returned by get_user_pages() could be truncated or | 
|  | migrated out from under us, while we're trying to mlock it.  To detect this, | 
|  | populate_vma_page_range() checks page_mapping() after acquiring the page lock. | 
|  | If the page is still associated with its mapping, we'll go ahead and call | 
|  | mlock_vma_page().  If the mapping is gone, we just unlock the page and move on. | 
|  | In the worst case, this will result in a page mapped in a VM_LOCKED VMA | 
|  | remaining on a normal LRU list without being PageMlocked().  Again, vmscan will | 
|  | detect and cull such pages. | 
|  |  | 
|  | mlock_vma_page() will call TestSetPageMlocked() for each page returned by | 
|  | get_user_pages().  We use TestSetPageMlocked() because the page might already | 
|  | be mlocked by another task/VMA and we don't want to do extra work.  We | 
|  | especially do not want to count an mlocked page more than once in the | 
|  | statistics.  If the page was already mlocked, mlock_vma_page() need do nothing | 
|  | more. | 
|  |  | 
|  | If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the | 
|  | page from the LRU, as it is likely on the appropriate active or inactive list | 
|  | at that time.  If the isolate_lru_page() succeeds, mlock_vma_page() will put | 
|  | back the page - by calling putback_lru_page() - which will notice that the page | 
|  | is now mlocked and divert the page to the zone's unevictable list.  If | 
|  | mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle | 
|  | it later if and when it attempts to reclaim the page. | 
|  |  | 
|  |  | 
|  | FILTERING SPECIAL VMAS | 
|  | ---------------------- | 
|  |  | 
|  | mlock_fixup() filters several classes of "special" VMAs: | 
|  |  | 
|  | 1) VMAs with VM_IO or VM_PFNMAP set are skipped entirely.  The pages behind | 
|  | these mappings are inherently pinned, so we don't need to mark them as | 
|  | mlocked.  In any case, most of the pages have no struct page in which to so | 
|  | mark the page.  Because of this, get_user_pages() will fail for these VMAs, | 
|  | so there is no sense in attempting to visit them. | 
|  |  | 
|  | 2) VMAs mapping hugetlbfs page are already effectively pinned into memory.  We | 
|  | neither need nor want to mlock() these pages.  However, to preserve the | 
|  | prior behavior of mlock() - before the unevictable/mlock changes - | 
|  | mlock_fixup() will call make_pages_present() in the hugetlbfs VMA range to | 
|  | allocate the huge pages and populate the ptes. | 
|  |  | 
|  | 3) VMAs with VM_DONTEXPAND are generally userspace mappings of kernel pages, | 
|  | such as the VDSO page, relay channel pages, etc. These pages | 
|  | are inherently unevictable and are not managed on the LRU lists. | 
|  | mlock_fixup() treats these VMAs the same as hugetlbfs VMAs.  It calls | 
|  | make_pages_present() to populate the ptes. | 
|  |  | 
|  | Note that for all of these special VMAs, mlock_fixup() does not set the | 
|  | VM_LOCKED flag.  Therefore, we won't have to deal with them later during | 
|  | munlock(), munmap() or task exit.  Neither does mlock_fixup() account these | 
|  | VMAs against the task's "locked_vm". | 
|  |  | 
|  |  | 
|  | munlock()/munlockall() SYSTEM CALL HANDLING | 
|  | ------------------------------------------- | 
|  |  | 
|  | The munlock() and munlockall() system calls are handled by the same functions - | 
|  | do_mlock[all]() - as the mlock() and mlockall() system calls with the unlock vs | 
|  | lock operation indicated by an argument.  So, these system calls are also | 
|  | handled by mlock_fixup().  Again, if called for an already munlocked VMA, | 
|  | mlock_fixup() simply returns.  Because of the VMA filtering discussed above, | 
|  | VM_LOCKED will not be set in any "special" VMAs.  So, these VMAs will be | 
|  | ignored for munlock. | 
|  |  | 
|  | If the VMA is VM_LOCKED, mlock_fixup() again attempts to merge or split off the | 
|  | specified range.  The range is then munlocked via the function | 
|  | populate_vma_page_range() - the same function used to mlock a VMA range - | 
|  | passing a flag to indicate that munlock() is being performed. | 
|  |  | 
|  | Because the VMA access protections could have been changed to PROT_NONE after | 
|  | faulting in and mlocking pages, get_user_pages() was unreliable for visiting | 
|  | these pages for munlocking.  Because we don't want to leave pages mlocked, | 
|  | get_user_pages() was enhanced to accept a flag to ignore the permissions when | 
|  | fetching the pages - all of which should be resident as a result of previous | 
|  | mlocking. | 
|  |  | 
|  | For munlock(), populate_vma_page_range() unlocks individual pages by calling | 
|  | munlock_vma_page().  munlock_vma_page() unconditionally clears the PG_mlocked | 
|  | flag using TestClearPageMlocked().  As with mlock_vma_page(), | 
|  | munlock_vma_page() use the Test*PageMlocked() function to handle the case where | 
|  | the page might have already been unlocked by another task.  If the page was | 
|  | mlocked, munlock_vma_page() updates that zone statistics for the number of | 
|  | mlocked pages.  Note, however, that at this point we haven't checked whether | 
|  | the page is mapped by other VM_LOCKED VMAs. | 
|  |  | 
|  | We can't call try_to_munlock(), the function that walks the reverse map to | 
|  | check for other VM_LOCKED VMAs, without first isolating the page from the LRU. | 
|  | try_to_munlock() is a variant of try_to_unmap() and thus requires that the page | 
|  | not be on an LRU list [more on these below].  However, the call to | 
|  | isolate_lru_page() could fail, in which case we couldn't try_to_munlock().  So, | 
|  | we go ahead and clear PG_mlocked up front, as this might be the only chance we | 
|  | have.  If we can successfully isolate the page, we go ahead and | 
|  | try_to_munlock(), which will restore the PG_mlocked flag and update the zone | 
|  | page statistics if it finds another VMA holding the page mlocked.  If we fail | 
|  | to isolate the page, we'll have left a potentially mlocked page on the LRU. | 
|  | This is fine, because we'll catch it later if and if vmscan tries to reclaim | 
|  | the page.  This should be relatively rare. | 
|  |  | 
|  |  | 
|  | MIGRATING MLOCKED PAGES | 
|  | ----------------------- | 
|  |  | 
|  | A page that is being migrated has been isolated from the LRU lists and is held | 
|  | locked across unmapping of the page, updating the page's address space entry | 
|  | and copying the contents and state, until the page table entry has been | 
|  | replaced with an entry that refers to the new page.  Linux supports migration | 
|  | of mlocked pages and other unevictable pages.  This involves simply moving the | 
|  | PG_mlocked and PG_unevictable states from the old page to the new page. | 
|  |  | 
|  | Note that page migration can race with mlocking or munlocking of the same page. | 
|  | This has been discussed from the mlock/munlock perspective in the respective | 
|  | sections above.  Both processes (migration and m[un]locking) hold the page | 
|  | locked.  This provides the first level of synchronization.  Page migration | 
|  | zeros out the page_mapping of the old page before unlocking it, so m[un]lock | 
|  | can skip these pages by testing the page mapping under page lock. | 
|  |  | 
|  | To complete page migration, we place the new and old pages back onto the LRU | 
|  | after dropping the page lock.  The "unneeded" page - old page on success, new | 
|  | page on failure - will be freed when the reference count held by the migration | 
|  | process is released.  To ensure that we don't strand pages on the unevictable | 
|  | list because of a race between munlock and migration, page migration uses the | 
|  | putback_lru_page() function to add migrated pages back to the LRU. | 
|  |  | 
|  |  | 
|  | COMPACTING MLOCKED PAGES | 
|  | ------------------------ | 
|  |  | 
|  | The unevictable LRU can be scanned for compactable regions and the default | 
|  | behavior is to do so.  /proc/sys/vm/compact_unevictable_allowed controls | 
|  | this behavior (see Documentation/sysctl/vm.txt).  Once scanning of the | 
|  | unevictable LRU is enabled, the work of compaction is mostly handled by | 
|  | the page migration code and the same work flow as described in MIGRATING | 
|  | MLOCKED PAGES will apply. | 
|  |  | 
|  |  | 
|  | mmap(MAP_LOCKED) SYSTEM CALL HANDLING | 
|  | ------------------------------------- | 
|  |  | 
|  | In addition the mlock()/mlockall() system calls, an application can request | 
|  | that a region of memory be mlocked supplying the MAP_LOCKED flag to the mmap() | 
|  | call.  Furthermore, any mmap() call or brk() call that expands the heap by a | 
|  | task that has previously called mlockall() with the MCL_FUTURE flag will result | 
|  | in the newly mapped memory being mlocked.  Before the unevictable/mlock | 
|  | changes, the kernel simply called make_pages_present() to allocate pages and | 
|  | populate the page table. | 
|  |  | 
|  | To mlock a range of memory under the unevictable/mlock infrastructure, the | 
|  | mmap() handler and task address space expansion functions call | 
|  | populate_vma_page_range() specifying the vma and the address range to mlock. | 
|  |  | 
|  | The callers of populate_vma_page_range() will have already added the memory range | 
|  | to be mlocked to the task's "locked_vm".  To account for filtered VMAs, | 
|  | populate_vma_page_range() returns the number of pages NOT mlocked.  All of the | 
|  | callers then subtract a non-negative return value from the task's locked_vm.  A | 
|  | negative return value represent an error - for example, from get_user_pages() | 
|  | attempting to fault in a VMA with PROT_NONE access.  In this case, we leave the | 
|  | memory range accounted as locked_vm, as the protections could be changed later | 
|  | and pages allocated into that region. | 
|  |  | 
|  |  | 
|  | munmap()/exit()/exec() SYSTEM CALL HANDLING | 
|  | ------------------------------------------- | 
|  |  | 
|  | When unmapping an mlocked region of memory, whether by an explicit call to | 
|  | munmap() or via an internal unmap from exit() or exec() processing, we must | 
|  | munlock the pages if we're removing the last VM_LOCKED VMA that maps the pages. | 
|  | Before the unevictable/mlock changes, mlocking did not mark the pages in any | 
|  | way, so unmapping them required no processing. | 
|  |  | 
|  | To munlock a range of memory under the unevictable/mlock infrastructure, the | 
|  | munmap() handler and task address space call tear down function | 
|  | munlock_vma_pages_all().  The name reflects the observation that one always | 
|  | specifies the entire VMA range when munlock()ing during unmap of a region. | 
|  | Because of the VMA filtering when mlocking() regions, only "normal" VMAs that | 
|  | actually contain mlocked pages will be passed to munlock_vma_pages_all(). | 
|  |  | 
|  | munlock_vma_pages_all() clears the VM_LOCKED VMA flag and, like mlock_fixup() | 
|  | for the munlock case, calls __munlock_vma_pages_range() to walk the page table | 
|  | for the VMA's memory range and munlock_vma_page() each resident page mapped by | 
|  | the VMA.  This effectively munlocks the page, only if this is the last | 
|  | VM_LOCKED VMA that maps the page. | 
|  |  | 
|  |  | 
|  | try_to_unmap() | 
|  | -------------- | 
|  |  | 
|  | Pages can, of course, be mapped into multiple VMAs.  Some of these VMAs may | 
|  | have VM_LOCKED flag set.  It is possible for a page mapped into one or more | 
|  | VM_LOCKED VMAs not to have the PG_mlocked flag set and therefore reside on one | 
|  | of the active or inactive LRU lists.  This could happen if, for example, a task | 
|  | in the process of munlocking the page could not isolate the page from the LRU. | 
|  | As a result, vmscan/shrink_page_list() might encounter such a page as described | 
|  | in section "vmscan's handling of unevictable pages".  To handle this situation, | 
|  | try_to_unmap() checks for VM_LOCKED VMAs while it is walking a page's reverse | 
|  | map. | 
|  |  | 
|  | try_to_unmap() is always called, by either vmscan for reclaim or for page | 
|  | migration, with the argument page locked and isolated from the LRU.  Separate | 
|  | functions handle anonymous and mapped file pages, as these types of pages have | 
|  | different reverse map mechanisms. | 
|  |  | 
|  | (*) try_to_unmap_anon() | 
|  |  | 
|  | To unmap anonymous pages, each VMA in the list anchored in the anon_vma | 
|  | must be visited - at least until a VM_LOCKED VMA is encountered.  If the | 
|  | page is being unmapped for migration, VM_LOCKED VMAs do not stop the | 
|  | process because mlocked pages are migratable.  However, for reclaim, if | 
|  | the page is mapped into a VM_LOCKED VMA, the scan stops. | 
|  |  | 
|  | try_to_unmap_anon() attempts to acquire in read mode the mmap semaphore of | 
|  | the mm_struct to which the VMA belongs.  If this is successful, it will | 
|  | mlock the page via mlock_vma_page() - we wouldn't have gotten to | 
|  | try_to_unmap_anon() if the page were already mlocked - and will return | 
|  | SWAP_MLOCK, indicating that the page is unevictable. | 
|  |  | 
|  | If the mmap semaphore cannot be acquired, we are not sure whether the page | 
|  | is really unevictable or not.  In this case, try_to_unmap_anon() will | 
|  | return SWAP_AGAIN. | 
|  |  | 
|  | (*) try_to_unmap_file() - linear mappings | 
|  |  | 
|  | Unmapping of a mapped file page works the same as for anonymous mappings, | 
|  | except that the scan visits all VMAs that map the page's index/page offset | 
|  | in the page's mapping's reverse map priority search tree.  It also visits | 
|  | each VMA in the page's mapping's non-linear list, if the list is | 
|  | non-empty. | 
|  |  | 
|  | As for anonymous pages, on encountering a VM_LOCKED VMA for a mapped file | 
|  | page, try_to_unmap_file() will attempt to acquire the associated | 
|  | mm_struct's mmap semaphore to mlock the page, returning SWAP_MLOCK if this | 
|  | is successful, and SWAP_AGAIN, if not. | 
|  |  | 
|  | (*) try_to_unmap_file() - non-linear mappings | 
|  |  | 
|  | If a page's mapping contains a non-empty non-linear mapping VMA list, then | 
|  | try_to_un{map|lock}() must also visit each VMA in that list to determine | 
|  | whether the page is mapped in a VM_LOCKED VMA.  Again, the scan must visit | 
|  | all VMAs in the non-linear list to ensure that the pages is not/should not | 
|  | be mlocked. | 
|  |  | 
|  | If a VM_LOCKED VMA is found in the list, the scan could terminate. | 
|  | However, there is no easy way to determine whether the page is actually | 
|  | mapped in a given VMA - either for unmapping or testing whether the | 
|  | VM_LOCKED VMA actually pins the page. | 
|  |  | 
|  | try_to_unmap_file() handles non-linear mappings by scanning a certain | 
|  | number of pages - a "cluster" - in each non-linear VMA associated with the | 
|  | page's mapping, for each file mapped page that vmscan tries to unmap.  If | 
|  | this happens to unmap the page we're trying to unmap, try_to_unmap() will | 
|  | notice this on return (page_mapcount(page) will be 0) and return | 
|  | SWAP_SUCCESS.  Otherwise, it will return SWAP_AGAIN, causing vmscan to | 
|  | recirculate this page.  We take advantage of the cluster scan in | 
|  | try_to_unmap_cluster() as follows: | 
|  |  | 
|  | For each non-linear VMA, try_to_unmap_cluster() attempts to acquire the | 
|  | mmap semaphore of the associated mm_struct for read without blocking. | 
|  |  | 
|  | If this attempt is successful and the VMA is VM_LOCKED, | 
|  | try_to_unmap_cluster() will retain the mmap semaphore for the scan; | 
|  | otherwise it drops it here. | 
|  |  | 
|  | Then, for each page in the cluster, if we're holding the mmap semaphore | 
|  | for a locked VMA, try_to_unmap_cluster() calls mlock_vma_page() to | 
|  | mlock the page.  This call is a no-op if the page is already locked, | 
|  | but will mlock any pages in the non-linear mapping that happen to be | 
|  | unlocked. | 
|  |  | 
|  | If one of the pages so mlocked is the page passed in to try_to_unmap(), | 
|  | try_to_unmap_cluster() will return SWAP_MLOCK, rather than the default | 
|  | SWAP_AGAIN.  This will allow vmscan to cull the page, rather than | 
|  | recirculating it on the inactive list. | 
|  |  | 
|  | Again, if try_to_unmap_cluster() cannot acquire the VMA's mmap sem, it | 
|  | returns SWAP_AGAIN, indicating that the page is mapped by a VM_LOCKED | 
|  | VMA, but couldn't be mlocked. | 
|  |  | 
|  |  | 
|  | try_to_munlock() REVERSE MAP SCAN | 
|  | --------------------------------- | 
|  |  | 
|  | [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the | 
|  | page_referenced() reverse map walker. | 
|  |  | 
|  | When munlock_vma_page() [see section "munlock()/munlockall() System Call | 
|  | Handling" above] tries to munlock a page, it needs to determine whether or not | 
|  | the page is mapped by any VM_LOCKED VMA without actually attempting to unmap | 
|  | all PTEs from the page.  For this purpose, the unevictable/mlock infrastructure | 
|  | introduced a variant of try_to_unmap() called try_to_munlock(). | 
|  |  | 
|  | try_to_munlock() calls the same functions as try_to_unmap() for anonymous and | 
|  | mapped file pages with an additional argument specifying unlock versus unmap | 
|  | processing.  Again, these functions walk the respective reverse maps looking | 
|  | for VM_LOCKED VMAs.  When such a VMA is found for anonymous pages and file | 
|  | pages mapped in linear VMAs, as in the try_to_unmap() case, the functions | 
|  | attempt to acquire the associated mmap semaphore, mlock the page via | 
|  | mlock_vma_page() and return SWAP_MLOCK.  This effectively undoes the | 
|  | pre-clearing of the page's PG_mlocked done by munlock_vma_page. | 
|  |  | 
|  | If try_to_unmap() is unable to acquire a VM_LOCKED VMA's associated mmap | 
|  | semaphore, it will return SWAP_AGAIN.  This will allow shrink_page_list() to | 
|  | recycle the page on the inactive list and hope that it has better luck with the | 
|  | page next time. | 
|  |  | 
|  | For file pages mapped into non-linear VMAs, the try_to_munlock() logic works | 
|  | slightly differently.  On encountering a VM_LOCKED non-linear VMA that might | 
|  | map the page, try_to_munlock() returns SWAP_AGAIN without actually mlocking the | 
|  | page.  munlock_vma_page() will just leave the page unlocked and let vmscan deal | 
|  | with it - the usual fallback position. | 
|  |  | 
|  | Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's | 
|  | reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA. | 
|  | However, the scan can terminate when it encounters a VM_LOCKED VMA and can | 
|  | successfully acquire the VMA's mmap semaphore for read and mlock the page. | 
|  | Although try_to_munlock() might be called a great many times when munlocking a | 
|  | large region or tearing down a large address space that has been mlocked via | 
|  | mlockall(), overall this is a fairly rare event. | 
|  |  | 
|  |  | 
|  | PAGE RECLAIM IN shrink_*_list() | 
|  | ------------------------------- | 
|  |  | 
|  | shrink_active_list() culls any obviously unevictable pages - i.e. | 
|  | !page_evictable(page) - diverting these to the unevictable list. | 
|  | However, shrink_active_list() only sees unevictable pages that made it onto the | 
|  | active/inactive lru lists.  Note that these pages do not have PageUnevictable | 
|  | set - otherwise they would be on the unevictable list and shrink_active_list | 
|  | would never see them. | 
|  |  | 
|  | Some examples of these unevictable pages on the LRU lists are: | 
|  |  | 
|  | (1) ramfs pages that have been placed on the LRU lists when first allocated. | 
|  |  | 
|  | (2) SHM_LOCK'd shared memory pages.  shmctl(SHM_LOCK) does not attempt to | 
|  | allocate or fault in the pages in the shared memory region.  This happens | 
|  | when an application accesses the page the first time after SHM_LOCK'ing | 
|  | the segment. | 
|  |  | 
|  | (3) mlocked pages that could not be isolated from the LRU and moved to the | 
|  | unevictable list in mlock_vma_page(). | 
|  |  | 
|  | (4) Pages mapped into multiple VM_LOCKED VMAs, but try_to_munlock() couldn't | 
|  | acquire the VMA's mmap semaphore to test the flags and set PageMlocked. | 
|  | munlock_vma_page() was forced to let the page back on to the normal LRU | 
|  | list for vmscan to handle. | 
|  |  | 
|  | shrink_inactive_list() also diverts any unevictable pages that it finds on the | 
|  | inactive lists to the appropriate zone's unevictable list. | 
|  |  | 
|  | shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd | 
|  | after shrink_active_list() had moved them to the inactive list, or pages mapped | 
|  | into VM_LOCKED VMAs that munlock_vma_page() couldn't isolate from the LRU to | 
|  | recheck via try_to_munlock().  shrink_inactive_list() won't notice the latter, | 
|  | but will pass on to shrink_page_list(). | 
|  |  | 
|  | shrink_page_list() again culls obviously unevictable pages that it could | 
|  | encounter for similar reason to shrink_inactive_list().  Pages mapped into | 
|  | VM_LOCKED VMAs but without PG_mlocked set will make it all the way to | 
|  | try_to_unmap().  shrink_page_list() will divert them to the unevictable list | 
|  | when try_to_unmap() returns SWAP_MLOCK, as discussed above. |