| ----------------------------------------------------------------------------- |
| This file contains a concatenation of the PCRE2 man pages, converted to plain |
| text format for ease of searching with a text editor, or for use on systems |
| that do not have a man page processor. The small individual files that give |
| synopses of each function in the library have not been included. Neither has |
| the pcre2demo program. There are separate text files for the pcre2grep and |
| pcre2test commands. |
| ----------------------------------------------------------------------------- |
| |
| |
| PCRE2(3) Library Functions Manual PCRE2(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| |
| INTRODUCTION |
| |
| PCRE2 is the name used for a revised API for the PCRE library, which is |
| a set of functions, written in C, that implement regular expression |
| pattern matching using the same syntax and semantics as Perl, with just |
| a few differences. Some features that appeared in Python and the origi- |
| nal PCRE before they appeared in Perl are also available using the |
| Python syntax. There is also some support for one or two .NET and Onig- |
| uruma syntax items, and there are options for requesting some minor |
| changes that give better ECMAScript (aka JavaScript) compatibility. |
| |
| The source code for PCRE2 can be compiled to support 8-bit, 16-bit, or |
| 32-bit code units, which means that up to three separate libraries may |
| be installed. The original work to extend PCRE to 16-bit and 32-bit |
| code units was done by Zoltan Herczeg and Christian Persch, respec- |
| tively. In all three cases, strings can be interpreted either as one |
| character per code unit, or as UTF-encoded Unicode, with support for |
| Unicode general category properties. Unicode support is optional at |
| build time (but is the default). However, processing strings as UTF |
| code units must be enabled explicitly at run time. The version of Uni- |
| code in use can be discovered by running |
| |
| pcre2test -C |
| |
| The three libraries contain identical sets of functions, with names |
| ending in _8, _16, or _32, respectively (for example, pcre2_com- |
| pile_8()). However, by defining PCRE2_CODE_UNIT_WIDTH to be 8, 16, or |
| 32, a program that uses just one code unit width can be written using |
| generic names such as pcre2_compile(), and the documentation is written |
| assuming that this is the case. |
| |
| In addition to the Perl-compatible matching function, PCRE2 contains an |
| alternative function that matches the same compiled patterns in a dif- |
| ferent way. In certain circumstances, the alternative function has some |
| advantages. For a discussion of the two matching algorithms, see the |
| pcre2matching page. |
| |
| Details of exactly which Perl regular expression features are and are |
| not supported by PCRE2 are given in separate documents. See the |
| pcre2pattern and pcre2compat pages. There is a syntax summary in the |
| pcre2syntax page. |
| |
| Some features of PCRE2 can be included, excluded, or changed when the |
| library is built. The pcre2_config() function makes it possible for a |
| client to discover which features are available. The features them- |
| selves are described in the pcre2build page. Documentation about build- |
| ing PCRE2 for various operating systems can be found in the README and |
| NON-AUTOTOOLS_BUILD files in the source distribution. |
| |
| The libraries contains a number of undocumented internal functions and |
| data tables that are used by more than one of the exported external |
| functions, but which are not intended for use by external callers. |
| Their names all begin with "_pcre2", which hopefully will not provoke |
| any name clashes. In some environments, it is possible to control which |
| external symbols are exported when a shared library is built, and in |
| these cases the undocumented symbols are not exported. |
| |
| |
| SECURITY CONSIDERATIONS |
| |
| If you are using PCRE2 in a non-UTF application that permits users to |
| supply arbitrary patterns for compilation, you should be aware of a |
| feature that allows users to turn on UTF support from within a pattern. |
| For example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8 |
| mode, which interprets patterns and subjects as strings of UTF-8 code |
| units instead of individual 8-bit characters. This causes both the pat- |
| tern and any data against which it is matched to be checked for UTF-8 |
| validity. If the data string is very long, such a check might use suf- |
| ficiently many resources as to cause your application to lose perfor- |
| mance. |
| |
| One way of guarding against this possibility is to use the pcre2_pat- |
| tern_info() function to check the compiled pattern's options for |
| PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when |
| calling pcre2_compile(). This causes an compile time error if a pattern |
| contains a UTF-setting sequence. |
| |
| The use of Unicode properties for character types such as \d can also |
| be enabled from within the pattern, by specifying "(*UCP)". This fea- |
| ture can be disallowed by setting the PCRE2_NEVER_UCP option. |
| |
| If your application is one that supports UTF, be aware that validity |
| checking can take time. If the same data string is to be matched many |
| times, you can use the PCRE2_NO_UTF_CHECK option for the second and |
| subsequent matches to avoid running redundant checks. |
| |
| The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead |
| to problems, because it may leave the current matching point in the |
| middle of a multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C |
| option can be used by an application to lock out the use of \C, causing |
| a compile-time error if it is encountered. It is also possible to build |
| PCRE2 with the use of \C permanently disabled. |
| |
| Another way that performance can be hit is by running a pattern that |
| has a very large search tree against a string that will never match. |
| Nested unlimited repeats in a pattern are a common example. PCRE2 pro- |
| vides some protection against this: see the pcre2_set_match_limit() |
| function in the pcre2api page. |
| |
| |
| USER DOCUMENTATION |
| |
| The user documentation for PCRE2 comprises a number of different sec- |
| tions. In the "man" format, each of these is a separate "man page". In |
| the HTML format, each is a separate page, linked from the index page. |
| In the plain text format, the descriptions of the pcre2grep and |
| pcre2test programs are in files called pcre2grep.txt and pcre2test.txt, |
| respectively. The remaining sections, except for the pcre2demo section |
| (which is a program listing), and the short pages for individual func- |
| tions, are concatenated in pcre2.txt, for ease of searching. The sec- |
| tions are as follows: |
| |
| pcre2 this document |
| pcre2-config show PCRE2 installation configuration information |
| pcre2api details of PCRE2's native C API |
| pcre2build building PCRE2 |
| pcre2callout details of the callout feature |
| pcre2compat discussion of Perl compatibility |
| pcre2demo a demonstration C program that uses PCRE2 |
| pcre2grep description of the pcre2grep command (8-bit only) |
| pcre2jit discussion of just-in-time optimization support |
| pcre2limits details of size and other limits |
| pcre2matching discussion of the two matching algorithms |
| pcre2partial details of the partial matching facility |
| pcre2pattern syntax and semantics of supported regular |
| expression patterns |
| pcre2perform discussion of performance issues |
| pcre2posix the POSIX-compatible C API for the 8-bit library |
| pcre2sample discussion of the pcre2demo program |
| pcre2stack discussion of stack usage |
| pcre2syntax quick syntax reference |
| pcre2test description of the pcre2test command |
| pcre2unicode discussion of Unicode and UTF support |
| |
| In the "man" and HTML formats, there is also a short page for each C |
| library function, listing its arguments and results. |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| University Computing Service |
| Cambridge, England. |
| |
| Putting an actual email address here is a spam magnet. If you want to |
| email me, use my two initials, followed by the two digits 10, at the |
| domain cam.ac.uk. |
| |
| |
| REVISION |
| |
| Last updated: 16 October 2015 |
| Copyright (c) 1997-2015 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |
| PCRE2API(3) Library Functions Manual PCRE2API(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| |
| #include <pcre2.h> |
| |
| PCRE2 is a new API for PCRE. This document contains a description of |
| all its functions. See the pcre2 document for an overview of all the |
| PCRE2 documentation. |
| |
| |
| PCRE2 NATIVE API BASIC FUNCTIONS |
| |
| pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length, |
| uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset, |
| pcre2_compile_context *ccontext); |
| |
| void pcre2_code_free(pcre2_code *code); |
| |
| pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize, |
| pcre2_general_context *gcontext); |
| |
| pcre2_match_data *pcre2_match_data_create_from_pattern( |
| const pcre2_code *code, pcre2_general_context *gcontext); |
| |
| int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject, |
| PCRE2_SIZE length, PCRE2_SIZE startoffset, |
| uint32_t options, pcre2_match_data *match_data, |
| pcre2_match_context *mcontext); |
| |
| int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject, |
| PCRE2_SIZE length, PCRE2_SIZE startoffset, |
| uint32_t options, pcre2_match_data *match_data, |
| pcre2_match_context *mcontext, |
| int *workspace, PCRE2_SIZE wscount); |
| |
| void pcre2_match_data_free(pcre2_match_data *match_data); |
| |
| |
| PCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS |
| |
| PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data); |
| |
| uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data); |
| |
| PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data); |
| |
| PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data); |
| |
| |
| PCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS |
| |
| pcre2_general_context *pcre2_general_context_create( |
| void *(*private_malloc)(PCRE2_SIZE, void *), |
| void (*private_free)(void *, void *), void *memory_data); |
| |
| pcre2_general_context *pcre2_general_context_copy( |
| pcre2_general_context *gcontext); |
| |
| void pcre2_general_context_free(pcre2_general_context *gcontext); |
| |
| |
| PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS |
| |
| pcre2_compile_context *pcre2_compile_context_create( |
| pcre2_general_context *gcontext); |
| |
| pcre2_compile_context *pcre2_compile_context_copy( |
| pcre2_compile_context *ccontext); |
| |
| void pcre2_compile_context_free(pcre2_compile_context *ccontext); |
| |
| int pcre2_set_bsr(pcre2_compile_context *ccontext, |
| uint32_t value); |
| |
| int pcre2_set_character_tables(pcre2_compile_context *ccontext, |
| const unsigned char *tables); |
| |
| int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext, |
| PCRE2_SIZE value); |
| |
| int pcre2_set_newline(pcre2_compile_context *ccontext, |
| uint32_t value); |
| |
| int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext, |
| uint32_t value); |
| |
| int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext, |
| int (*guard_function)(uint32_t, void *), void *user_data); |
| |
| |
| PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS |
| |
| pcre2_match_context *pcre2_match_context_create( |
| pcre2_general_context *gcontext); |
| |
| pcre2_match_context *pcre2_match_context_copy( |
| pcre2_match_context *mcontext); |
| |
| void pcre2_match_context_free(pcre2_match_context *mcontext); |
| |
| int pcre2_set_callout(pcre2_match_context *mcontext, |
| int (*callout_function)(pcre2_callout_block *, void *), |
| void *callout_data); |
| |
| int pcre2_set_match_limit(pcre2_match_context *mcontext, |
| uint32_t value); |
| |
| int pcre2_set_offset_limit(pcre2_match_context *mcontext, |
| PCRE2_SIZE value); |
| |
| int pcre2_set_recursion_limit(pcre2_match_context *mcontext, |
| uint32_t value); |
| |
| int pcre2_set_recursion_memory_management( |
| pcre2_match_context *mcontext, |
| void *(*private_malloc)(PCRE2_SIZE, void *), |
| void (*private_free)(void *, void *), void *memory_data); |
| |
| |
| PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS |
| |
| int pcre2_substring_copy_byname(pcre2_match_data *match_data, |
| PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen); |
| |
| int pcre2_substring_copy_bynumber(pcre2_match_data *match_data, |
| uint32_t number, PCRE2_UCHAR *buffer, |
| PCRE2_SIZE *bufflen); |
| |
| void pcre2_substring_free(PCRE2_UCHAR *buffer); |
| |
| int pcre2_substring_get_byname(pcre2_match_data *match_data, |
| PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen); |
| |
| int pcre2_substring_get_bynumber(pcre2_match_data *match_data, |
| uint32_t number, PCRE2_UCHAR **bufferptr, |
| PCRE2_SIZE *bufflen); |
| |
| int pcre2_substring_length_byname(pcre2_match_data *match_data, |
| PCRE2_SPTR name, PCRE2_SIZE *length); |
| |
| int pcre2_substring_length_bynumber(pcre2_match_data *match_data, |
| uint32_t number, PCRE2_SIZE *length); |
| |
| int pcre2_substring_nametable_scan(const pcre2_code *code, |
| PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last); |
| |
| int pcre2_substring_number_from_name(const pcre2_code *code, |
| PCRE2_SPTR name); |
| |
| void pcre2_substring_list_free(PCRE2_SPTR *list); |
| |
| int pcre2_substring_list_get(pcre2_match_data *match_data, |
| PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr); |
| |
| |
| PCRE2 NATIVE API STRING SUBSTITUTION FUNCTION |
| |
| int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject, |
| PCRE2_SIZE length, PCRE2_SIZE startoffset, |
| uint32_t options, pcre2_match_data *match_data, |
| pcre2_match_context *mcontext, PCRE2_SPTR replacementzfP, |
| PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer, |
| PCRE2_SIZE *outlengthptr); |
| |
| |
| PCRE2 NATIVE API JIT FUNCTIONS |
| |
| int pcre2_jit_compile(pcre2_code *code, uint32_t options); |
| |
| int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject, |
| PCRE2_SIZE length, PCRE2_SIZE startoffset, |
| uint32_t options, pcre2_match_data *match_data, |
| pcre2_match_context *mcontext); |
| |
| void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext); |
| |
| pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize, |
| PCRE2_SIZE maxsize, pcre2_general_context *gcontext); |
| |
| void pcre2_jit_stack_assign(pcre2_match_context *mcontext, |
| pcre2_jit_callback callback_function, void *callback_data); |
| |
| void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack); |
| |
| |
| PCRE2 NATIVE API SERIALIZATION FUNCTIONS |
| |
| int32_t pcre2_serialize_decode(pcre2_code **codes, |
| int32_t number_of_codes, const uint8_t *bytes, |
| pcre2_general_context *gcontext); |
| |
| int32_t pcre2_serialize_encode(const pcre2_code **codes, |
| int32_t number_of_codes, uint8_t **serialized_bytes, |
| PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext); |
| |
| void pcre2_serialize_free(uint8_t *bytes); |
| |
| int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes); |
| |
| |
| PCRE2 NATIVE API AUXILIARY FUNCTIONS |
| |
| int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer, |
| PCRE2_SIZE bufflen); |
| |
| const unsigned char *pcre2_maketables(pcre2_general_context *gcontext); |
| |
| int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where); |
| |
| int pcre2_callout_enumerate(const pcre2_code *code, |
| int (*callback)(pcre2_callout_enumerate_block *, void *), |
| void *user_data); |
| |
| int pcre2_config(uint32_t what, void *where); |
| |
| |
| PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES |
| |
| There are three PCRE2 libraries, supporting 8-bit, 16-bit, and 32-bit |
| code units, respectively. However, there is just one header file, |
| pcre2.h. This contains the function prototypes and other definitions |
| for all three libraries. One, two, or all three can be installed simul- |
| taneously. On Unix-like systems the libraries are called libpcre2-8, |
| libpcre2-16, and libpcre2-32, and they can also co-exist with the orig- |
| inal PCRE libraries. |
| |
| Character strings are passed to and from a PCRE2 library as a sequence |
| of unsigned integers in code units of the appropriate width. Every |
| PCRE2 function comes in three different forms, one for each library, |
| for example: |
| |
| pcre2_compile_8() |
| pcre2_compile_16() |
| pcre2_compile_32() |
| |
| There are also three different sets of data types: |
| |
| PCRE2_UCHAR8, PCRE2_UCHAR16, PCRE2_UCHAR32 |
| PCRE2_SPTR8, PCRE2_SPTR16, PCRE2_SPTR32 |
| |
| The UCHAR types define unsigned code units of the appropriate widths. |
| For example, PCRE2_UCHAR16 is usually defined as `uint16_t'. The SPTR |
| types are constant pointers to the equivalent UCHAR types, that is, |
| they are pointers to vectors of unsigned code units. |
| |
| Many applications use only one code unit width. For their convenience, |
| macros are defined whose names are the generic forms such as pcre2_com- |
| pile() and PCRE2_SPTR. These macros use the value of the macro |
| PCRE2_CODE_UNIT_WIDTH to generate the appropriate width-specific func- |
| tion and macro names. PCRE2_CODE_UNIT_WIDTH is not defined by default. |
| An application must define it to be 8, 16, or 32 before including |
| pcre2.h in order to make use of the generic names. |
| |
| Applications that use more than one code unit width can be linked with |
| more than one PCRE2 library, but must define PCRE2_CODE_UNIT_WIDTH to |
| be 0 before including pcre2.h, and then use the real function names. |
| Any code that is to be included in an environment where the value of |
| PCRE2_CODE_UNIT_WIDTH is unknown should also use the real function |
| names. (Unfortunately, it is not possible in C code to save and restore |
| the value of a macro.) |
| |
| If PCRE2_CODE_UNIT_WIDTH is not defined before including pcre2.h, a |
| compiler error occurs. |
| |
| When using multiple libraries in an application, you must take care |
| when processing any particular pattern to use only functions from a |
| single library. For example, if you want to run a match using a pat- |
| tern that was compiled with pcre2_compile_16(), you must do so with |
| pcre2_match_16(), not pcre2_match_8(). |
| |
| In the function summaries above, and in the rest of this document and |
| other PCRE2 documents, functions and data types are described using |
| their generic names, without the 8, 16, or 32 suffix. |
| |
| |
| PCRE2 API OVERVIEW |
| |
| PCRE2 has its own native API, which is described in this document. |
| There are also some wrapper functions for the 8-bit library that corre- |
| spond to the POSIX regular expression API, but they do not give access |
| to all the functionality. They are described in the pcre2posix documen- |
| tation. Both these APIs define a set of C function calls. |
| |
| The native API C data types, function prototypes, option values, and |
| error codes are defined in the header file pcre2.h, which contains def- |
| initions of PCRE2_MAJOR and PCRE2_MINOR, the major and minor release |
| numbers for the library. Applications can use these to include support |
| for different releases of PCRE2. |
| |
| In a Windows environment, if you want to statically link an application |
| program against a non-dll PCRE2 library, you must define PCRE2_STATIC |
| before including pcre2.h. |
| |
| The functions pcre2_compile(), and pcre2_match() are used for compiling |
| and matching regular expressions in a Perl-compatible manner. A sample |
| program that demonstrates the simplest way of using them is provided in |
| the file called pcre2demo.c in the PCRE2 source distribution. A listing |
| of this program is given in the pcre2demo documentation, and the |
| pcre2sample documentation describes how to compile and run it. |
| |
| Just-in-time compiler support is an optional feature of PCRE2 that can |
| be built in appropriate hardware environments. It greatly speeds up the |
| matching performance of many patterns. Programs can request that it be |
| used if available, by calling pcre2_jit_compile() after a pattern has |
| been successfully compiled by pcre2_compile(). This does nothing if JIT |
| support is not available. |
| |
| More complicated programs might need to make use of the specialist |
| functions pcre2_jit_stack_create(), pcre2_jit_stack_free(), and |
| pcre2_jit_stack_assign() in order to control the JIT code's memory |
| usage. |
| |
| JIT matching is automatically used by pcre2_match() if it is available. |
| There is also a direct interface for JIT matching, which gives improved |
| performance. The JIT-specific functions are discussed in the pcre2jit |
| documentation. |
| |
| A second matching function, pcre2_dfa_match(), which is not Perl-com- |
| patible, is also provided. This uses a different algorithm for the |
| matching. The alternative algorithm finds all possible matches (at a |
| given point in the subject), and scans the subject just once (unless |
| there are lookbehind assertions). However, this algorithm does not |
| return captured substrings. A description of the two matching algo- |
| rithms and their advantages and disadvantages is given in the |
| pcre2matching documentation. There is no JIT support for |
| pcre2_dfa_match(). |
| |
| In addition to the main compiling and matching functions, there are |
| convenience functions for extracting captured substrings from a subject |
| string that has been matched by pcre2_match(). They are: |
| |
| pcre2_substring_copy_byname() |
| pcre2_substring_copy_bynumber() |
| pcre2_substring_get_byname() |
| pcre2_substring_get_bynumber() |
| pcre2_substring_list_get() |
| pcre2_substring_length_byname() |
| pcre2_substring_length_bynumber() |
| pcre2_substring_nametable_scan() |
| pcre2_substring_number_from_name() |
| |
| pcre2_substring_free() and pcre2_substring_list_free() are also pro- |
| vided, to free the memory used for extracted strings. |
| |
| The function pcre2_substitute() can be called to match a pattern and |
| return a copy of the subject string with substitutions for parts that |
| were matched. |
| |
| Finally, there are functions for finding out information about a com- |
| piled pattern (pcre2_pattern_info()) and about the configuration with |
| which PCRE2 was built (pcre2_config()). |
| |
| |
| STRING LENGTHS AND OFFSETS |
| |
| The PCRE2 API uses string lengths and offsets into strings of code |
| units in several places. These values are always of type PCRE2_SIZE, |
| which is an unsigned integer type, currently always defined as size_t. |
| The largest value that can be stored in such a type (that is |
| ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-terminated |
| strings and unset offsets. Therefore, the longest string that can be |
| handled is one less than this maximum. |
| |
| |
| NEWLINES |
| |
| PCRE2 supports five different conventions for indicating line breaks in |
| strings: a single CR (carriage return) character, a single LF (line- |
| feed) character, the two-character sequence CRLF, any of the three pre- |
| ceding, or any Unicode newline sequence. The Unicode newline sequences |
| are the three just mentioned, plus the single characters VT (vertical |
| tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line |
| separator, U+2028), and PS (paragraph separator, U+2029). |
| |
| Each of the first three conventions is used by at least one operating |
| system as its standard newline sequence. When PCRE2 is built, a default |
| can be specified. The default default is LF, which is the Unix stan- |
| dard. However, the newline convention can be changed by an application |
| when calling pcre2_compile(), or it can be specified by special text at |
| the start of the pattern itself; this overrides any other settings. See |
| the pcre2pattern page for details of the special character sequences. |
| |
| In the PCRE2 documentation the word "newline" is used to mean "the |
| character or pair of characters that indicate a line break". The choice |
| of newline convention affects the handling of the dot, circumflex, and |
| dollar metacharacters, the handling of #-comments in /x mode, and, when |
| CRLF is a recognized line ending sequence, the match position advance- |
| ment for a non-anchored pattern. There is more detail about this in the |
| section on pcre2_match() options below. |
| |
| The choice of newline convention does not affect the interpretation of |
| the \n or \r escape sequences, nor does it affect what \R matches; this |
| has its own separate convention. |
| |
| |
| MULTITHREADING |
| |
| In a multithreaded application it is important to keep thread-specific |
| data separate from data that can be shared between threads. The PCRE2 |
| library code itself is thread-safe: it contains no static or global |
| variables. The API is designed to be fairly simple for non-threaded |
| applications while at the same time ensuring that multithreaded appli- |
| cations can use it. |
| |
| There are several different blocks of data that are used to pass infor- |
| mation between the application and the PCRE2 libraries. |
| |
| (1) A pointer to the compiled form of a pattern is returned to the user |
| when pcre2_compile() is successful. The data in the compiled pattern is |
| fixed, and does not change when the pattern is matched. Therefore, it |
| is thread-safe, that is, the same compiled pattern can be used by more |
| than one thread simultaneously. An application can compile all its pat- |
| terns at the start, before forking off multiple threads that use them. |
| However, if the just-in-time optimization feature is being used, it |
| needs separate memory stack areas for each thread. See the pcre2jit |
| documentation for more details. |
| |
| (2) The next section below introduces the idea of "contexts" in which |
| PCRE2 functions are called. A context is nothing more than a collection |
| of parameters that control the way PCRE2 operates. Grouping a number of |
| parameters together in a context is a convenient way of passing them to |
| a PCRE2 function without using lots of arguments. The parameters that |
| are stored in contexts are in some sense "advanced features" of the |
| API. Many straightforward applications will not need to use contexts. |
| |
| In a multithreaded application, if the parameters in a context are val- |
| ues that are never changed, the same context can be used by all the |
| threads. However, if any thread needs to change any value in a context, |
| it must make its own thread-specific copy. |
| |
| (3) The matching functions need a block of memory for working space and |
| for storing the results of a match. This includes details of what was |
| matched, as well as additional information such as the name of a |
| (*MARK) setting. Each thread must provide its own version of this mem- |
| ory. |
| |
| |
| PCRE2 CONTEXTS |
| |
| Some PCRE2 functions have a lot of parameters, many of which are used |
| only by specialist applications, for example, those that use custom |
| memory management or non-standard character tables. To keep function |
| argument lists at a reasonable size, and at the same time to keep the |
| API extensible, "uncommon" parameters are passed to certain functions |
| in a context instead of directly. A context is just a block of memory |
| that holds the parameter values. Applications that do not need to |
| adjust any of the context parameters can pass NULL when a context |
| pointer is required. |
| |
| There are three different types of context: a general context that is |
| relevant for several PCRE2 operations, a compile-time context, and a |
| match-time context. |
| |
| The general context |
| |
| At present, this context just contains pointers to (and data for) |
| external memory management functions that are called from several |
| places in the PCRE2 library. The context is named `general' rather than |
| specifically `memory' because in future other fields may be added. If |
| you do not want to supply your own custom memory management functions, |
| you do not need to bother with a general context. A general context is |
| created by: |
| |
| pcre2_general_context *pcre2_general_context_create( |
| void *(*private_malloc)(PCRE2_SIZE, void *), |
| void (*private_free)(void *, void *), void *memory_data); |
| |
| The two function pointers specify custom memory management functions, |
| whose prototypes are: |
| |
| void *private_malloc(PCRE2_SIZE, void *); |
| void private_free(void *, void *); |
| |
| Whenever code in PCRE2 calls these functions, the final argument is the |
| value of memory_data. Either of the first two arguments of the creation |
| function may be NULL, in which case the system memory management func- |
| tions malloc() and free() are used. (This is not currently useful, as |
| there are no other fields in a general context, but in future there |
| might be.) The private_malloc() function is used (if supplied) to |
| obtain memory for storing the context, and all three values are saved |
| as part of the context. |
| |
| Whenever PCRE2 creates a data block of any kind, the block contains a |
| pointer to the free() function that matches the malloc() function that |
| was used. When the time comes to free the block, this function is |
| called. |
| |
| A general context can be copied by calling: |
| |
| pcre2_general_context *pcre2_general_context_copy( |
| pcre2_general_context *gcontext); |
| |
| The memory used for a general context should be freed by calling: |
| |
| void pcre2_general_context_free(pcre2_general_context *gcontext); |
| |
| |
| The compile context |
| |
| A compile context is required if you want to change the default values |
| of any of the following compile-time parameters: |
| |
| What \R matches (Unicode newlines or CR, LF, CRLF only) |
| PCRE2's character tables |
| The newline character sequence |
| The compile time nested parentheses limit |
| The maximum length of the pattern string |
| An external function for stack checking |
| |
| A compile context is also required if you are using custom memory man- |
| agement. If none of these apply, just pass NULL as the context argu- |
| ment of pcre2_compile(). |
| |
| A compile context is created, copied, and freed by the following func- |
| tions: |
| |
| pcre2_compile_context *pcre2_compile_context_create( |
| pcre2_general_context *gcontext); |
| |
| pcre2_compile_context *pcre2_compile_context_copy( |
| pcre2_compile_context *ccontext); |
| |
| void pcre2_compile_context_free(pcre2_compile_context *ccontext); |
| |
| A compile context is created with default values for its parameters. |
| These can be changed by calling the following functions, which return 0 |
| on success, or PCRE2_ERROR_BADDATA if invalid data is detected. |
| |
| int pcre2_set_bsr(pcre2_compile_context *ccontext, |
| uint32_t value); |
| |
| The value must be PCRE2_BSR_ANYCRLF, to specify that \R matches only |
| CR, LF, or CRLF, or PCRE2_BSR_UNICODE, to specify that \R matches any |
| Unicode line ending sequence. The value is used by the JIT compiler and |
| by the two interpreted matching functions, pcre2_match() and |
| pcre2_dfa_match(). |
| |
| int pcre2_set_character_tables(pcre2_compile_context *ccontext, |
| const unsigned char *tables); |
| |
| The value must be the result of a call to pcre2_maketables(), whose |
| only argument is a general context. This function builds a set of char- |
| acter tables in the current locale. |
| |
| int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext, |
| PCRE2_SIZE value); |
| |
| This sets a maximum length, in code units, for the pattern string that |
| is to be compiled. If the pattern is longer, an error is generated. |
| This facility is provided so that applications that accept patterns |
| from external sources can limit their size. The default is the largest |
| number that a PCRE2_SIZE variable can hold, which is effectively unlim- |
| ited. |
| |
| int pcre2_set_newline(pcre2_compile_context *ccontext, |
| uint32_t value); |
| |
| This specifies which characters or character sequences are to be recog- |
| nized as newlines. The value must be one of PCRE2_NEWLINE_CR (carriage |
| return only), PCRE2_NEWLINE_LF (linefeed only), PCRE2_NEWLINE_CRLF (the |
| two-character sequence CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any |
| of the above), or PCRE2_NEWLINE_ANY (any Unicode newline sequence). |
| |
| When a pattern is compiled with the PCRE2_EXTENDED option, the value of |
| this parameter affects the recognition of white space and the end of |
| internal comments starting with #. The value is saved with the compiled |
| pattern for subsequent use by the JIT compiler and by the two inter- |
| preted matching functions, pcre2_match() and pcre2_dfa_match(). |
| |
| int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext, |
| uint32_t value); |
| |
| This parameter ajusts the limit, set when PCRE2 is built (default 250), |
| on the depth of parenthesis nesting in a pattern. This limit stops |
| rogue patterns using up too much system stack when being compiled. |
| |
| int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext, |
| int (*guard_function)(uint32_t, void *), void *user_data); |
| |
| There is at least one application that runs PCRE2 in threads with very |
| limited system stack, where running out of stack is to be avoided at |
| all costs. The parenthesis limit above cannot take account of how much |
| stack is actually available. For a finer control, you can supply a |
| function that is called whenever pcre2_compile() starts to compile a |
| parenthesized part of a pattern. This function can check the actual |
| stack size (or anything else that it wants to, of course). |
| |
| The first argument to the callout function gives the current depth of |
| nesting, and the second is user data that is set up by the last argu- |
| ment of pcre2_set_compile_recursion_guard(). The callout function |
| should return zero if all is well, or non-zero to force an error. |
| |
| The match context |
| |
| A match context is required if you want to change the default values of |
| any of the following match-time parameters: |
| |
| A callout function |
| The offset limit for matching an unanchored pattern |
| The limit for calling match() (see below) |
| The limit for calling match() recursively |
| |
| A match context is also required if you are using custom memory manage- |
| ment. If none of these apply, just pass NULL as the context argument |
| of pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match(). |
| |
| A match context is created, copied, and freed by the following func- |
| tions: |
| |
| pcre2_match_context *pcre2_match_context_create( |
| pcre2_general_context *gcontext); |
| |
| pcre2_match_context *pcre2_match_context_copy( |
| pcre2_match_context *mcontext); |
| |
| void pcre2_match_context_free(pcre2_match_context *mcontext); |
| |
| A match context is created with default values for its parameters. |
| These can be changed by calling the following functions, which return 0 |
| on success, or PCRE2_ERROR_BADDATA if invalid data is detected. |
| |
| int pcre2_set_callout(pcre2_match_context *mcontext, |
| int (*callout_function)(pcre2_callout_block *, void *), |
| void *callout_data); |
| |
| This sets up a "callout" function, which PCRE2 will call at specified |
| points during a matching operation. Details are given in the pcre2call- |
| out documentation. |
| |
| int pcre2_set_offset_limit(pcre2_match_context *mcontext, |
| PCRE2_SIZE value); |
| |
| The offset_limit parameter limits how far an unanchored search can |
| advance in the subject string. The default value is PCRE2_UNSET. The |
| pcre2_match() and pcre2_dfa_match() functions return |
| PCRE2_ERROR_NOMATCH if a match with a starting point before or at the |
| given offset is not found. For example, if the pattern /abc/ is matched |
| against "123abc" with an offset limit less than 3, the result is |
| PCRE2_ERROR_NO_MATCH. A match can never be found if the startoffset |
| argument of pcre2_match() or pcre2_dfa_match() is greater than the off- |
| set limit. |
| |
| When using this facility, you must set PCRE2_USE_OFFSET_LIMIT when |
| calling pcre2_compile() so that when JIT is in use, different code can |
| be compiled. If a match is started with a non-default match limit when |
| PCRE2_USE_OFFSET_LIMIT is not set, an error is generated. |
| |
| The offset limit facility can be used to track progress when searching |
| large subject strings. See also the PCRE2_FIRSTLINE option, which |
| requires a match to start within the first line of the subject. If this |
| is set with an offset limit, a match must occur in the first line and |
| also within the offset limit. In other words, whichever limit comes |
| first is used. |
| |
| int pcre2_set_match_limit(pcre2_match_context *mcontext, |
| uint32_t value); |
| |
| The match_limit parameter provides a means of preventing PCRE2 from |
| using up too many resources when processing patterns that are not going |
| to match, but which have a very large number of possibilities in their |
| search trees. The classic example is a pattern that uses nested unlim- |
| ited repeats. |
| |
| Internally, pcre2_match() uses a function called match(), which it |
| calls repeatedly (sometimes recursively). The limit set by match_limit |
| is imposed on the number of times this function is called during a |
| match, which has the effect of limiting the amount of backtracking that |
| can take place. For patterns that are not anchored, the count restarts |
| from zero for each position in the subject string. This limit is not |
| relevant to pcre2_dfa_match(), which ignores it. |
| |
| When pcre2_match() is called with a pattern that was successfully pro- |
| cessed by pcre2_jit_compile(), the way in which matching is executed is |
| entirely different. However, there is still the possibility of runaway |
| matching that goes on for a very long time, and so the match_limit |
| value is also used in this case (but in a different way) to limit how |
| long the matching can continue. |
| |
| The default value for the limit can be set when PCRE2 is built; the |
| default default is 10 million, which handles all but the most extreme |
| cases. If the limit is exceeded, pcre2_match() returns |
| PCRE2_ERROR_MATCHLIMIT. A value for the match limit may also be sup- |
| plied by an item at the start of a pattern of the form |
| |
| (*LIMIT_MATCH=ddd) |
| |
| where ddd is a decimal number. However, such a setting is ignored |
| unless ddd is less than the limit set by the caller of pcre2_match() |
| or, if no such limit is set, less than the default. |
| |
| int pcre2_set_recursion_limit(pcre2_match_context *mcontext, |
| uint32_t value); |
| |
| The recursion_limit parameter is similar to match_limit, but instead of |
| limiting the total number of times that match() is called, it limits |
| the depth of recursion. The recursion depth is a smaller number than |
| the total number of calls, because not all calls to match() are recur- |
| sive. This limit is of use only if it is set smaller than match_limit. |
| |
| Limiting the recursion depth limits the amount of system stack that can |
| be used, or, when PCRE2 has been compiled to use memory on the heap |
| instead of the stack, the amount of heap memory that can be used. This |
| limit is not relevant, and is ignored, when matching is done using JIT |
| compiled code or by the pcre2_dfa_match() function. |
| |
| The default value for recursion_limit can be set when PCRE2 is built; |
| the default default is the same value as the default for match_limit. |
| If the limit is exceeded, pcre2_match() returns PCRE2_ERROR_RECURSION- |
| LIMIT. A value for the recursion limit may also be supplied by an item |
| at the start of a pattern of the form |
| |
| (*LIMIT_RECURSION=ddd) |
| |
| where ddd is a decimal number. However, such a setting is ignored |
| unless ddd is less than the limit set by the caller of pcre2_match() |
| or, if no such limit is set, less than the default. |
| |
| int pcre2_set_recursion_memory_management( |
| pcre2_match_context *mcontext, |
| void *(*private_malloc)(PCRE2_SIZE, void *), |
| void (*private_free)(void *, void *), void *memory_data); |
| |
| This function sets up two additional custom memory management functions |
| for use by pcre2_match() when PCRE2 is compiled to use the heap for |
| remembering backtracking data, instead of recursive function calls that |
| use the system stack. There is a discussion about PCRE2's stack usage |
| in the pcre2stack documentation. See the pcre2build documentation for |
| details of how to build PCRE2. |
| |
| Using the heap for recursion is a non-standard way of building PCRE2, |
| for use in environments that have limited stacks. Because of the |
| greater use of memory management, pcre2_match() runs more slowly. Func- |
| tions that are different to the general custom memory functions are |
| provided so that special-purpose external code can be used for this |
| case, because the memory blocks are all the same size. The blocks are |
| retained by pcre2_match() until it is about to exit so that they can be |
| re-used when possible during the match. In the absence of these func- |
| tions, the normal custom memory management functions are used, if sup- |
| plied, otherwise the system functions. |
| |
| |
| CHECKING BUILD-TIME OPTIONS |
| |
| int pcre2_config(uint32_t what, void *where); |
| |
| The function pcre2_config() makes it possible for a PCRE2 client to |
| discover which optional features have been compiled into the PCRE2 |
| library. The pcre2build documentation has more details about these |
| optional features. |
| |
| The first argument for pcre2_config() specifies which information is |
| required. The second argument is a pointer to memory into which the |
| information is placed. If NULL is passed, the function returns the |
| amount of memory that is needed for the requested information. For |
| calls that return numerical values, the value is in bytes; when |
| requesting these values, where should point to appropriately aligned |
| memory. For calls that return strings, the required length is given in |
| code units, not counting the terminating zero. |
| |
| When requesting information, the returned value from pcre2_config() is |
| non-negative on success, or the negative error code PCRE2_ERROR_BADOP- |
| TION if the value in the first argument is not recognized. The follow- |
| ing information is available: |
| |
| PCRE2_CONFIG_BSR |
| |
| The output is a uint32_t integer whose value indicates what character |
| sequences the \R escape sequence matches by default. A value of |
| PCRE2_BSR_UNICODE means that \R matches any Unicode line ending |
| sequence; a value of PCRE2_BSR_ANYCRLF means that \R matches only CR, |
| LF, or CRLF. The default can be overridden when a pattern is compiled. |
| |
| PCRE2_CONFIG_JIT |
| |
| The output is a uint32_t integer that is set to one if support for |
| just-in-time compiling is available; otherwise it is set to zero. |
| |
| PCRE2_CONFIG_JITTARGET |
| |
| The where argument should point to a buffer that is at least 48 code |
| units long. (The exact length required can be found by calling |
| pcre2_config() with where set to NULL.) The buffer is filled with a |
| string that contains the name of the architecture for which the JIT |
| compiler is configured, for example "x86 32bit (little endian + |
| unaligned)". If JIT support is not available, PCRE2_ERROR_BADOPTION is |
| returned, otherwise the number of code units used is returned. This is |
| the length of the string, plus one unit for the terminating zero. |
| |
| PCRE2_CONFIG_LINKSIZE |
| |
| The output is a uint32_t integer that contains the number of bytes used |
| for internal linkage in compiled regular expressions. When PCRE2 is |
| configured, the value can be set to 2, 3, or 4, with the default being |
| 2. This is the value that is returned by pcre2_config(). However, when |
| the 16-bit library is compiled, a value of 3 is rounded up to 4, and |
| when the 32-bit library is compiled, internal linkages always use 4 |
| bytes, so the configured value is not relevant. |
| |
| The default value of 2 for the 8-bit and 16-bit libraries is sufficient |
| for all but the most massive patterns, since it allows the size of the |
| compiled pattern to be up to 64K code units. Larger values allow larger |
| regular expressions to be compiled by those two libraries, but at the |
| expense of slower matching. |
| |
| PCRE2_CONFIG_MATCHLIMIT |
| |
| The output is a uint32_t integer that gives the default limit for the |
| number of internal matching function calls in a pcre2_match() execu- |
| tion. Further details are given with pcre2_match() below. |
| |
| PCRE2_CONFIG_NEWLINE |
| |
| The output is a uint32_t integer whose value specifies the default |
| character sequence that is recognized as meaning "newline". The values |
| are: |
| |
| PCRE2_NEWLINE_CR Carriage return (CR) |
| PCRE2_NEWLINE_LF Linefeed (LF) |
| PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF) |
| PCRE2_NEWLINE_ANY Any Unicode line ending |
| PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF |
| |
| The default should normally correspond to the standard sequence for |
| your operating system. |
| |
| PCRE2_CONFIG_PARENSLIMIT |
| |
| The output is a uint32_t integer that gives the maximum depth of nest- |
| ing of parentheses (of any kind) in a pattern. This limit is imposed to |
| cap the amount of system stack used when a pattern is compiled. It is |
| specified when PCRE2 is built; the default is 250. This limit does not |
| take into account the stack that may already be used by the calling |
| application. For finer control over compilation stack usage, see |
| pcre2_set_compile_recursion_guard(). |
| |
| PCRE2_CONFIG_RECURSIONLIMIT |
| |
| The output is a uint32_t integer that gives the default limit for the |
| depth of recursion when calling the internal matching function in a |
| pcre2_match() execution. Further details are given with pcre2_match() |
| below. |
| |
| PCRE2_CONFIG_STACKRECURSE |
| |
| The output is a uint32_t integer that is set to one if internal recur- |
| sion when running pcre2_match() is implemented by recursive function |
| calls that use the system stack to remember their state. This is the |
| usual way that PCRE2 is compiled. The output is zero if PCRE2 was com- |
| piled to use blocks of data on the heap instead of recursive function |
| calls. |
| |
| PCRE2_CONFIG_UNICODE_VERSION |
| |
| The where argument should point to a buffer that is at least 24 code |
| units long. (The exact length required can be found by calling |
| pcre2_config() with where set to NULL.) If PCRE2 has been compiled |
| without Unicode support, the buffer is filled with the text "Unicode |
| not supported". Otherwise, the Unicode version string (for example, |
| "8.0.0") is inserted. The number of code units used is returned. This |
| is the length of the string plus one unit for the terminating zero. |
| |
| PCRE2_CONFIG_UNICODE |
| |
| The output is a uint32_t integer that is set to one if Unicode support |
| is available; otherwise it is set to zero. Unicode support implies UTF |
| support. |
| |
| PCRE2_CONFIG_VERSION |
| |
| The where argument should point to a buffer that is at least 12 code |
| units long. (The exact length required can be found by calling |
| pcre2_config() with where set to NULL.) The buffer is filled with the |
| PCRE2 version string, zero-terminated. The number of code units used is |
| returned. This is the length of the string plus one unit for the termi- |
| nating zero. |
| |
| |
| COMPILING A PATTERN |
| |
| pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length, |
| uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset, |
| pcre2_compile_context *ccontext); |
| |
| void pcre2_code_free(pcre2_code *code); |
| |
| The pcre2_compile() function compiles a pattern into an internal form. |
| The pattern is defined by a pointer to a string of code units and a |
| length, If the pattern is zero-terminated, the length can be specified |
| as PCRE2_ZERO_TERMINATED. The function returns a pointer to a block of |
| memory that contains the compiled pattern and related data. The caller |
| must free the memory by calling pcre2_code_free() when it is no longer |
| needed. |
| |
| NOTE: When one of the matching functions is called, pointers to the |
| compiled pattern and the subject string are set in the match data block |
| so that they can be referenced by the extraction functions. After run- |
| ning a match, you must not free a compiled pattern (or a subject |
| string) until after all operations on the match data block have taken |
| place. |
| |
| If the compile context argument ccontext is NULL, memory for the com- |
| piled pattern is obtained by calling malloc(). Otherwise, it is |
| obtained from the same memory function that was used for the compile |
| context. |
| |
| The options argument contains various bit settings that affect the com- |
| pilation. It should be zero if no options are required. The available |
| options are described below. Some of them (in particular, those that |
| are compatible with Perl, but some others as well) can also be set and |
| unset from within the pattern (see the detailed description in the |
| pcre2pattern documentation). |
| |
| For those options that can be different in different parts of the pat- |
| tern, the contents of the options argument specifies their settings at |
| the start of compilation. The PCRE2_ANCHORED and PCRE2_NO_UTF_CHECK |
| options can be set at the time of matching as well as at compile time. |
| |
| Other, less frequently required compile-time parameters (for example, |
| the newline setting) can be provided in a compile context (as described |
| above). |
| |
| If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme- |
| diately. Otherwise, if compilation of a pattern fails, pcre2_compile() |
| returns NULL, having set these variables to an error code and an offset |
| (number of code units) within the pattern, respectively. The |
| pcre2_get_error_message() function provides a textual message for each |
| error code. Compilation errors are positive numbers, but UTF formatting |
| errors are negative numbers. For an invalid UTF-8 or UTF-16 string, the |
| offset is that of the first code unit of the failing character. |
| |
| Some errors are not detected until the whole pattern has been scanned; |
| in these cases, the offset passed back is the length of the pattern. |
| Note that the offset is in code units, not characters, even in a UTF |
| mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char- |
| acter. |
| |
| This code fragment shows a typical straightforward call to pcre2_com- |
| pile(): |
| |
| pcre2_code *re; |
| PCRE2_SIZE erroffset; |
| int errorcode; |
| re = pcre2_compile( |
| "^A.*Z", /* the pattern */ |
| PCRE2_ZERO_TERMINATED, /* the pattern is zero-terminated */ |
| 0, /* default options */ |
| &errorcode, /* for error code */ |
| &erroffset, /* for error offset */ |
| NULL); /* no compile context */ |
| |
| The following names for option bits are defined in the pcre2.h header |
| file: |
| |
| PCRE2_ANCHORED |
| |
| If this bit is set, the pattern is forced to be "anchored", that is, it |
| is constrained to match only at the first matching point in the string |
| that is being searched (the "subject string"). This effect can also be |
| achieved by appropriate constructs in the pattern itself, which is the |
| only way to do it in Perl. |
| |
| PCRE2_ALLOW_EMPTY_CLASS |
| |
| By default, for compatibility with Perl, a closing square bracket that |
| immediately follows an opening one is treated as a data character for |
| the class. When PCRE2_ALLOW_EMPTY_CLASS is set, it terminates the |
| class, which therefore contains no characters and so can never match. |
| |
| PCRE2_ALT_BSUX |
| |
| This option request alternative handling of three escape sequences, |
| which makes PCRE2's behaviour more like ECMAscript (aka JavaScript). |
| When it is set: |
| |
| (1) \U matches an upper case "U" character; by default \U causes a com- |
| pile time error (Perl uses \U to upper case subsequent characters). |
| |
| (2) \u matches a lower case "u" character unless it is followed by four |
| hexadecimal digits, in which case the hexadecimal number defines the |
| code point to match. By default, \u causes a compile time error (Perl |
| uses it to upper case the following character). |
| |
| (3) \x matches a lower case "x" character unless it is followed by two |
| hexadecimal digits, in which case the hexadecimal number defines the |
| code point to match. By default, as in Perl, a hexadecimal number is |
| always expected after \x, but it may have zero, one, or two digits (so, |
| for example, \xz matches a binary zero character followed by z). |
| |
| PCRE2_ALT_CIRCUMFLEX |
| |
| In multiline mode (when PCRE2_MULTILINE is set), the circumflex |
| metacharacter matches at the start of the subject (unless PCRE2_NOTBOL |
| is set), and also after any internal newline. However, it does not |
| match after a newline at the end of the subject, for compatibility with |
| Perl. If you want a multiline circumflex also to match after a termi- |
| nating newline, you must set PCRE2_ALT_CIRCUMFLEX. |
| |
| PCRE2_ALT_VERBNAMES |
| |
| By default, for compatibility with Perl, the name in any verb sequence |
| such as (*MARK:NAME) is any sequence of characters that does not |
| include a closing parenthesis. The name is not processed in any way, |
| and it is not possible to include a closing parenthesis in the name. |
| However, if the PCRE2_ALT_VERBNAMES option is set, normal backslash |
| processing is applied to verb names and only an unescaped closing |
| parenthesis terminates the name. A closing parenthesis can be included |
| in a name either as \) or between \Q and \E. If the PCRE2_EXTENDED |
| option is set, unescaped whitespace in verb names is skipped and #-com- |
| ments are recognized, exactly as in the rest of the pattern. |
| |
| PCRE2_AUTO_CALLOUT |
| |
| If this bit is set, pcre2_compile() automatically inserts callout |
| items, all with number 255, before each pattern item. For discussion of |
| the callout facility, see the pcre2callout documentation. |
| |
| PCRE2_CASELESS |
| |
| If this bit is set, letters in the pattern match both upper and lower |
| case letters in the subject. It is equivalent to Perl's /i option, and |
| it can be changed within a pattern by a (?i) option setting. |
| |
| PCRE2_DOLLAR_ENDONLY |
| |
| If this bit is set, a dollar metacharacter in the pattern matches only |
| at the end of the subject string. Without this option, a dollar also |
| matches immediately before a newline at the end of the string (but not |
| before any other newlines). The PCRE2_DOLLAR_ENDONLY option is ignored |
| if PCRE2_MULTILINE is set. There is no equivalent to this option in |
| Perl, and no way to set it within a pattern. |
| |
| PCRE2_DOTALL |
| |
| If this bit is set, a dot metacharacter in the pattern matches any |
| character, including one that indicates a newline. However, it only |
| ever matches one character, even if newlines are coded as CRLF. Without |
| this option, a dot does not match when the current position in the sub- |
| ject is at a newline. This option is equivalent to Perl's /s option, |
| and it can be changed within a pattern by a (?s) option setting. A neg- |
| ative class such as [^a] always matches newline characters, independent |
| of the setting of this option. |
| |
| PCRE2_DUPNAMES |
| |
| If this bit is set, names used to identify capturing subpatterns need |
| not be unique. This can be helpful for certain types of pattern when it |
| is known that only one instance of the named subpattern can ever be |
| matched. There are more details of named subpatterns below; see also |
| the pcre2pattern documentation. |
| |
| PCRE2_EXTENDED |
| |
| If this bit is set, most white space characters in the pattern are |
| totally ignored except when escaped or inside a character class. How- |
| ever, white space is not allowed within sequences such as (?> that |
| introduce various parenthesized subpatterns, nor within numerical quan- |
| tifiers such as {1,3}. Ignorable white space is permitted between an |
| item and a following quantifier and between a quantifier and a follow- |
| ing + that indicates possessiveness. |
| |
| PCRE2_EXTENDED also causes characters between an unescaped # outside a |
| character class and the next newline, inclusive, to be ignored, which |
| makes it possible to include comments inside complicated patterns. Note |
| that the end of this type of comment is a literal newline sequence in |
| the pattern; escape sequences that happen to represent a newline do not |
| count. PCRE2_EXTENDED is equivalent to Perl's /x option, and it can be |
| changed within a pattern by a (?x) option setting. |
| |
| Which characters are interpreted as newlines can be specified by a set- |
| ting in the compile context that is passed to pcre2_compile() or by a |
| special sequence at the start of the pattern, as described in the sec- |
| tion entitled "Newline conventions" in the pcre2pattern documentation. |
| A default is defined when PCRE2 is built. |
| |
| PCRE2_FIRSTLINE |
| |
| If this option is set, an unanchored pattern is required to match |
| before or at the first newline in the subject string, though the |
| matched text may continue over the newline. See also PCRE2_USE_OFF- |
| SET_LIMIT, which provides a more general limiting facility. If |
| PCRE2_FIRSTLINE is set with an offset limit, a match must occur in the |
| first line and also within the offset limit. In other words, whichever |
| limit comes first is used. |
| |
| PCRE2_MATCH_UNSET_BACKREF |
| |
| If this option is set, a back reference to an unset subpattern group |
| matches an empty string (by default this causes the current matching |
| alternative to fail). A pattern such as (\1)(a) succeeds when this |
| option is set (assuming it can find an "a" in the subject), whereas it |
| fails by default, for Perl compatibility. Setting this option makes |
| PCRE2 behave more like ECMAscript (aka JavaScript). |
| |
| PCRE2_MULTILINE |
| |
| By default, for the purposes of matching "start of line" and "end of |
| line", PCRE2 treats the subject string as consisting of a single line |
| of characters, even if it actually contains newlines. The "start of |
| line" metacharacter (^) matches only at the start of the string, and |
| the "end of line" metacharacter ($) matches only at the end of the |
| string, or before a terminating newline (except when PCRE2_DOL- |
| LAR_ENDONLY is set). Note, however, that unless PCRE2_DOTALL is set, |
| the "any character" metacharacter (.) does not match at a newline. This |
| behaviour (for ^, $, and dot) is the same as Perl. |
| |
| When PCRE2_MULTILINE it is set, the "start of line" and "end of line" |
| constructs match immediately following or immediately before internal |
| newlines in the subject string, respectively, as well as at the very |
| start and end. This is equivalent to Perl's /m option, and it can be |
| changed within a pattern by a (?m) option setting. Note that the "start |
| of line" metacharacter does not match after a newline at the end of the |
| subject, for compatibility with Perl. However, you can change this by |
| setting the PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in a |
| subject string, or no occurrences of ^ or $ in a pattern, setting |
| PCRE2_MULTILINE has no effect. |
| |
| PCRE2_NEVER_BACKSLASH_C |
| |
| This option locks out the use of \C in the pattern that is being com- |
| piled. This escape can cause unpredictable behaviour in UTF-8 or |
| UTF-16 modes, because it may leave the current matching point in the |
| middle of a multi-code-unit character. This option may be useful in |
| applications that process patterns from external sources. Note that |
| there is also a build-time option that permanently locks out the use of |
| \C. |
| |
| PCRE2_NEVER_UCP |
| |
| This option locks out the use of Unicode properties for handling \B, |
| \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes, as |
| described for the PCRE2_UCP option below. In particular, it prevents |
| the creator of the pattern from enabling this facility by starting the |
| pattern with (*UCP). This option may be useful in applications that |
| process patterns from external sources. The option combination PCRE_UCP |
| and PCRE_NEVER_UCP causes an error. |
| |
| PCRE2_NEVER_UTF |
| |
| This option locks out interpretation of the pattern as UTF-8, UTF-16, |
| or UTF-32, depending on which library is in use. In particular, it pre- |
| vents the creator of the pattern from switching to UTF interpretation |
| by starting the pattern with (*UTF). This option may be useful in |
| applications that process patterns from external sources. The combina- |
| tion of PCRE2_UTF and PCRE2_NEVER_UTF causes an error. |
| |
| PCRE2_NO_AUTO_CAPTURE |
| |
| If this option is set, it disables the use of numbered capturing paren- |
| theses in the pattern. Any opening parenthesis that is not followed by |
| ? behaves as if it were followed by ?: but named parentheses can still |
| be used for capturing (and they acquire numbers in the usual way). |
| There is no equivalent of this option in Perl. |
| |
| PCRE2_NO_AUTO_POSSESS |
| |
| If this option is set, it disables "auto-possessification", which is an |
| optimization that, for example, turns a+b into a++b in order to avoid |
| backtracks into a+ that can never be successful. However, if callouts |
| are in use, auto-possessification means that some callouts are never |
| taken. You can set this option if you want the matching functions to do |
| a full unoptimized search and run all the callouts, but it is mainly |
| provided for testing purposes. |
| |
| PCRE2_NO_DOTSTAR_ANCHOR |
| |
| If this option is set, it disables an optimization that is applied when |
| .* is the first significant item in a top-level branch of a pattern, |
| and all the other branches also start with .* or with \A or \G or ^. |
| The optimization is automatically disabled for .* if it is inside an |
| atomic group or a capturing group that is the subject of a back refer- |
| ence, or if the pattern contains (*PRUNE) or (*SKIP). When the opti- |
| mization is not disabled, such a pattern is automatically anchored if |
| PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set |
| for any ^ items. Otherwise, the fact that any match must start either |
| at the start of the subject or following a newline is remembered. Like |
| other optimizations, this can cause callouts to be skipped. |
| |
| PCRE2_NO_START_OPTIMIZE |
| |
| This is an option whose main effect is at matching time. It does not |
| change what pcre2_compile() generates, but it does affect the output of |
| the JIT compiler. |
| |
| There are a number of optimizations that may occur at the start of a |
| match, in order to speed up the process. For example, if it is known |
| that an unanchored match must start with a specific character, the |
| matching code searches the subject for that character, and fails imme- |
| diately if it cannot find it, without actually running the main match- |
| ing function. This means that a special item such as (*COMMIT) at the |
| start of a pattern is not considered until after a suitable starting |
| point for the match has been found. Also, when callouts or (*MARK) |
| items are in use, these "start-up" optimizations can cause them to be |
| skipped if the pattern is never actually used. The start-up optimiza- |
| tions are in effect a pre-scan of the subject that takes place before |
| the pattern is run. |
| |
| The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations, |
| possibly causing performance to suffer, but ensuring that in cases |
| where the result is "no match", the callouts do occur, and that items |
| such as (*COMMIT) and (*MARK) are considered at every possible starting |
| position in the subject string. |
| |
| Setting PCRE2_NO_START_OPTIMIZE may change the outcome of a matching |
| operation. Consider the pattern |
| |
| (*COMMIT)ABC |
| |
| When this is compiled, PCRE2 records the fact that a match must start |
| with the character "A". Suppose the subject string is "DEFABC". The |
| start-up optimization scans along the subject, finds "A" and runs the |
| first match attempt from there. The (*COMMIT) item means that the pat- |
| tern must match the current starting position, which in this case, it |
| does. However, if the same match is run with PCRE2_NO_START_OPTIMIZE |
| set, the initial scan along the subject string does not happen. The |
| first match attempt is run starting from "D" and when this fails, |
| (*COMMIT) prevents any further matches being tried, so the overall |
| result is "no match". There are also other start-up optimizations. For |
| example, a minimum length for the subject may be recorded. Consider the |
| pattern |
| |
| (*MARK:A)(X|Y) |
| |
| The minimum length for a match is one character. If the subject is |
| "ABC", there will be attempts to match "ABC", "BC", and "C". An attempt |
| to match an empty string at the end of the subject does not take place, |
| because PCRE2 knows that the subject is now too short, and so the |
| (*MARK) is never encountered. In this case, the optimization does not |
| affect the overall match result, which is still "no match", but it does |
| affect the auxiliary information that is returned. |
| |
| PCRE2_NO_UTF_CHECK |
| |
| When PCRE2_UTF is set, the validity of the pattern as a UTF string is |
| automatically checked. There are discussions about the validity of |
| UTF-8 strings, UTF-16 strings, and UTF-32 strings in the pcre2unicode |
| document. If an invalid UTF sequence is found, pcre2_compile() returns |
| a negative error code. |
| |
| If you know that your pattern is valid, and you want to skip this check |
| for performance reasons, you can set the PCRE2_NO_UTF_CHECK option. |
| When it is set, the effect of passing an invalid UTF string as a pat- |
| tern is undefined. It may cause your program to crash or loop. Note |
| that this option can also be passed to pcre2_match() and |
| pcre_dfa_match(), to suppress validity checking of the subject string. |
| |
| PCRE2_UCP |
| |
| This option changes the way PCRE2 processes \B, \b, \D, \d, \S, \s, \W, |
| \w, and some of the POSIX character classes. By default, only ASCII |
| characters are recognized, but if PCRE2_UCP is set, Unicode properties |
| are used instead to classify characters. More details are given in the |
| section on generic character types in the pcre2pattern page. If you set |
| PCRE2_UCP, matching one of the items it affects takes much longer. The |
| option is available only if PCRE2 has been compiled with Unicode sup- |
| port. |
| |
| PCRE2_UNGREEDY |
| |
| This option inverts the "greediness" of the quantifiers so that they |
| are not greedy by default, but become greedy if followed by "?". It is |
| not compatible with Perl. It can also be set by a (?U) option setting |
| within the pattern. |
| |
| PCRE2_USE_OFFSET_LIMIT |
| |
| This option must be set for pcre2_compile() if pcre2_set_offset_limit() |
| is going to be used to set a non-default offset limit in a match con- |
| text for matches that use this pattern. An error is generated if an |
| offset limit is set without this option. For more details, see the |
| description of pcre2_set_offset_limit() in the section that describes |
| match contexts. See also the PCRE2_FIRSTLINE option above. |
| |
| PCRE2_UTF |
| |
| This option causes PCRE2 to regard both the pattern and the subject |
| strings that are subsequently processed as strings of UTF characters |
| instead of single-code-unit strings. It is available when PCRE2 is |
| built to include Unicode support (which is the default). If Unicode |
| support is not available, the use of this option provokes an error. |
| Details of how this option changes the behaviour of PCRE2 are given in |
| the pcre2unicode page. |
| |
| |
| COMPILATION ERROR CODES |
| |
| There are over 80 positive error codes that pcre2_compile() may return |
| if it finds an error in the pattern. There are also some negative error |
| codes that are used for invalid UTF strings. These are the same as |
| given by pcre2_match() and pcre2_dfa_match(), and are described in the |
| pcre2unicode page. The pcre2_get_error_message() function can be called |
| to obtain a textual error message from any error code. |
| |
| |
| JUST-IN-TIME (JIT) COMPILATION |
| |
| int pcre2_jit_compile(pcre2_code *code, uint32_t options); |
| |
| int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject, |
| PCRE2_SIZE length, PCRE2_SIZE startoffset, |
| uint32_t options, pcre2_match_data *match_data, |
| pcre2_match_context *mcontext); |
| |
| void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext); |
| |
| pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize, |
| PCRE2_SIZE maxsize, pcre2_general_context *gcontext); |
| |
| void pcre2_jit_stack_assign(pcre2_match_context *mcontext, |
| pcre2_jit_callback callback_function, void *callback_data); |
| |
| void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack); |
| |
| These functions provide support for JIT compilation, which, if the |
| just-in-time compiler is available, further processes a compiled pat- |
| tern into machine code that executes much faster than the pcre2_match() |
| interpretive matching function. Full details are given in the pcre2jit |
| documentation. |
| |
| JIT compilation is a heavyweight optimization. It can take some time |
| for patterns to be analyzed, and for one-off matches and simple pat- |
| terns the benefit of faster execution might be offset by a much slower |
| compilation time. Most, but not all patterns can be optimized by the |
| JIT compiler. |
| |
| |
| LOCALE SUPPORT |
| |
| PCRE2 handles caseless matching, and determines whether characters are |
| letters, digits, or whatever, by reference to a set of tables, indexed |
| by character code point. This applies only to characters whose code |
| points are less than 256. By default, higher-valued code points never |
| match escapes such as \w or \d. However, if PCRE2 is built with UTF |
| support, all characters can be tested with \p and \P, or, alterna- |
| tively, the PCRE2_UCP option can be set when a pattern is compiled; |
| this causes \w and friends to use Unicode property support instead of |
| the built-in tables. |
| |
| The use of locales with Unicode is discouraged. If you are handling |
| characters with code points greater than 128, you should either use |
| Unicode support, or use locales, but not try to mix the two. |
| |
| PCRE2 contains an internal set of character tables that are used by |
| default. These are sufficient for many applications. Normally, the |
| internal tables recognize only ASCII characters. However, when PCRE2 is |
| built, it is possible to cause the internal tables to be rebuilt in the |
| default "C" locale of the local system, which may cause them to be dif- |
| ferent. |
| |
| The internal tables can be overridden by tables supplied by the appli- |
| cation that calls PCRE2. These may be created in a different locale |
| from the default. As more and more applications change to using Uni- |
| code, the need for this locale support is expected to die away. |
| |
| External tables are built by calling the pcre2_maketables() function, |
| in the relevant locale. The result can be passed to pcre2_compile() as |
| often as necessary, by creating a compile context and calling |
| pcre2_set_character_tables() to set the tables pointer therein. For |
| example, to build and use tables that are appropriate for the French |
| locale (where accented characters with values greater than 128 are |
| treated as letters), the following code could be used: |
| |
| setlocale(LC_CTYPE, "fr_FR"); |
| tables = pcre2_maketables(NULL); |
| ccontext = pcre2_compile_context_create(NULL); |
| pcre2_set_character_tables(ccontext, tables); |
| re = pcre2_compile(..., ccontext); |
| |
| The locale name "fr_FR" is used on Linux and other Unix-like systems; |
| if you are using Windows, the name for the French locale is "french". |
| It is the caller's responsibility to ensure that the memory containing |
| the tables remains available for as long as it is needed. |
| |
| The pointer that is passed (via the compile context) to pcre2_compile() |
| is saved with the compiled pattern, and the same tables are used by |
| pcre2_match() and pcre_dfa_match(). Thus, for any single pattern, com- |
| pilation, and matching all happen in the same locale, but different |
| patterns can be processed in different locales. |
| |
| |
| INFORMATION ABOUT A COMPILED PATTERN |
| |
| int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where); |
| |
| The pcre2_pattern_info() function returns general information about a |
| compiled pattern. For information about callouts, see the next section. |
| The first argument for pcre2_pattern_info() is a pointer to the com- |
| piled pattern. The second argument specifies which piece of information |
| is required, and the third argument is a pointer to a variable to |
| receive the data. If the third argument is NULL, the first argument is |
| ignored, and the function returns the size in bytes of the variable |
| that is required for the information requested. Otherwise, The yield of |
| the function is zero for success, or one of the following negative num- |
| bers: |
| |
| PCRE2_ERROR_NULL the argument code was NULL |
| PCRE2_ERROR_BADMAGIC the "magic number" was not found |
| PCRE2_ERROR_BADOPTION the value of what was invalid |
| PCRE2_ERROR_UNSET the requested field is not set |
| |
| The "magic number" is placed at the start of each compiled pattern as |
| an simple check against passing an arbitrary memory pointer. Here is a |
| typical call of pcre2_pattern_info(), to obtain the length of the com- |
| piled pattern: |
| |
| int rc; |
| size_t length; |
| rc = pcre2_pattern_info( |
| re, /* result of pcre2_compile() */ |
| PCRE2_INFO_SIZE, /* what is required */ |
| &length); /* where to put the data */ |
| |
| The possible values for the second argument are defined in pcre2.h, and |
| are as follows: |
| |
| PCRE2_INFO_ALLOPTIONS |
| PCRE2_INFO_ARGOPTIONS |
| |
| Return a copy of the pattern's options. The third argument should point |
| to a uint32_t variable. PCRE2_INFO_ARGOPTIONS returns exactly the |
| options that were passed to pcre2_compile(), whereas PCRE2_INFO_ALLOP- |
| TIONS returns the compile options as modified by any top-level option |
| settings such as (*UTF) at the start of the pattern itself. For exam- |
| ple, if the pattern /(*UTF)abc/ is compiled with the PCRE2_EXTENDED |
| option, the result is PCRE2_EXTENDED and PCRE2_UTF. |
| |
| A pattern compiled without PCRE2_ANCHORED is automatically anchored by |
| PCRE2 if the first significant item in every top-level branch is one of |
| the following: |
| |
| ^ unless PCRE2_MULTILINE is set |
| \A always |
| \G always |
| .* sometimes - see below |
| |
| When .* is the first significant item, anchoring is possible only when |
| all the following are true: |
| |
| .* is not in an atomic group |
| .* is not in a capturing group that is the subject |
| of a back reference |
| PCRE2_DOTALL is in force for .* |
| Neither (*PRUNE) nor (*SKIP) appears in the pattern. |
| PCRE2_NO_DOTSTAR_ANCHOR is not set. |
| |
| For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in |
| the options returned for PCRE2_INFO_ALLOPTIONS. |
| |
| PCRE2_INFO_BACKREFMAX |
| |
| Return the number of the highest back reference in the pattern. The |
| third argument should point to an uint32_t variable. Named subpatterns |
| acquire numbers as well as names, and these count towards the highest |
| back reference. Back references such as \4 or \g{12} match the cap- |
| tured characters of the given group, but in addition, the check that a |
| capturing group is set in a conditional subpattern such as (?(3)a|b) is |
| also a back reference. Zero is returned if there are no back refer- |
| ences. |
| |
| PCRE2_INFO_BSR |
| |
| The output is a uint32_t whose value indicates what character sequences |
| the \R escape sequence matches. A value of PCRE2_BSR_UNICODE means that |
| \R matches any Unicode line ending sequence; a value of PCRE2_BSR_ANY- |
| CRLF means that \R matches only CR, LF, or CRLF. |
| |
| PCRE2_INFO_CAPTURECOUNT |
| |
| Return the highest capturing subpattern number in the pattern. In pat- |
| terns where (?| is not used, this is also the total number of capturing |
| subpatterns. The third argument should point to an uint32_t variable. |
| |
| PCRE2_INFO_FIRSTBITMAP |
| |
| In the absence of a single first code unit for a non-anchored pattern, |
| pcre2_compile() may construct a 256-bit table that defines a fixed set |
| of values for the first code unit in any match. For example, a pattern |
| that starts with [abc] results in a table with three bits set. When |
| code unit values greater than 255 are supported, the flag bit for 255 |
| means "any code unit of value 255 or above". If such a table was con- |
| structed, a pointer to it is returned. Otherwise NULL is returned. The |
| third argument should point to an const uint8_t * variable. |
| |
| PCRE2_INFO_FIRSTCODETYPE |
| |
| Return information about the first code unit of any matched string, for |
| a non-anchored pattern. The third argument should point to an uint32_t |
| variable. If there is a fixed first value, for example, the letter "c" |
| from a pattern such as (cat|cow|coyote), 1 is returned, and the charac- |
| ter value can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is |
| no fixed first value, but it is known that a match can occur only at |
| the start of the subject or following a newline in the subject, 2 is |
| returned. Otherwise, and for anchored patterns, 0 is returned. |
| |
| PCRE2_INFO_FIRSTCODEUNIT |
| |
| Return the value of the first code unit of any matched string in the |
| situation where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0. |
| The third argument should point to an uint32_t variable. In the 8-bit |
| library, the value is always less than 256. In the 16-bit library the |
| value can be up to 0xffff. In the 32-bit library in UTF-32 mode the |
| value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32 |
| mode. |
| |
| PCRE2_INFO_HASBACKSLASHC |
| |
| Return 1 if the pattern contains any instances of \C, otherwise 0. The |
| third argument should point to an uint32_t variable. |
| |
| PCRE2_INFO_HASCRORLF |
| |
| Return 1 if the pattern contains any explicit matches for CR or LF |
| characters, otherwise 0. The third argument should point to an uint32_t |
| variable. An explicit match is either a literal CR or LF character, or |
| \r or \n. |
| |
| PCRE2_INFO_JCHANGED |
| |
| Return 1 if the (?J) or (?-J) option setting is used in the pattern, |
| otherwise 0. The third argument should point to an uint32_t variable. |
| (?J) and (?-J) set and unset the local PCRE2_DUPNAMES option, respec- |
| tively. |
| |
| PCRE2_INFO_JITSIZE |
| |
| If the compiled pattern was successfully processed by pcre2_jit_com- |
| pile(), return the size of the JIT compiled code, otherwise return |
| zero. The third argument should point to a size_t variable. |
| |
| PCRE2_INFO_LASTCODETYPE |
| |
| Returns 1 if there is a rightmost literal code unit that must exist in |
| any matched string, other than at its start. The third argument should |
| point to an uint32_t variable. If there is no such value, 0 is |
| returned. When 1 is returned, the code unit value itself can be |
| retrieved using PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last |
| literal value is recorded only if it follows something of variable |
| length. For example, for the pattern /^a\d+z\d+/ the returned value is |
| 1 (with "z" returned from PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ |
| the returned value is 0. |
| |
| PCRE2_INFO_LASTCODEUNIT |
| |
| Return the value of the rightmost literal data unit that must exist in |
| any matched string, other than at its start, if such a value has been |
| recorded. The third argument should point to an uint32_t variable. If |
| there is no such value, 0 is returned. |
| |
| PCRE2_INFO_MATCHEMPTY |
| |
| Return 1 if the pattern might match an empty string, otherwise 0. The |
| third argument should point to an uint32_t variable. When a pattern |
| contains recursive subroutine calls it is not always possible to deter- |
| mine whether or not it can match an empty string. PCRE2 takes a cau- |
| tious approach and returns 1 in such cases. |
| |
| PCRE2_INFO_MATCHLIMIT |
| |
| If the pattern set a match limit by including an item of the form |
| (*LIMIT_MATCH=nnnn) at the start, the value is returned. The third |
| argument should point to an unsigned 32-bit integer. If no such value |
| has been set, the call to pcre2_pattern_info() returns the error |
| PCRE2_ERROR_UNSET. |
| |
| PCRE2_INFO_MAXLOOKBEHIND |
| |
| Return the number of characters (not code units) in the longest lookbe- |
| hind assertion in the pattern. The third argument should point to an |
| unsigned 32-bit integer. This information is useful when doing multi- |
| segment matching using the partial matching facilities. Note that the |
| simple assertions \b and \B require a one-character lookbehind. \A also |
| registers a one-character lookbehind, though it does not actually |
| inspect the previous character. This is to ensure that at least one |
| character from the old segment is retained when a new segment is pro- |
| cessed. Otherwise, if there are no lookbehinds in the pattern, \A might |
| match incorrectly at the start of a new segment. |
| |
| PCRE2_INFO_MINLENGTH |
| |
| If a minimum length for matching subject strings was computed, its |
| value is returned. Otherwise the returned value is 0. The value is a |
| number of characters, which in UTF mode may be different from the num- |
| ber of code units. The third argument should point to an uint32_t |
| variable. The value is a lower bound to the length of any matching |
| string. There may not be any strings of that length that do actually |
| match, but every string that does match is at least that long. |
| |
| PCRE2_INFO_NAMECOUNT |
| PCRE2_INFO_NAMEENTRYSIZE |
| PCRE2_INFO_NAMETABLE |
| |
| PCRE2 supports the use of named as well as numbered capturing parenthe- |
| ses. The names are just an additional way of identifying the parenthe- |
| ses, which still acquire numbers. Several convenience functions such as |
| pcre2_substring_get_byname() are provided for extracting captured sub- |
| strings by name. It is also possible to extract the data directly, by |
| first converting the name to a number in order to access the correct |
| pointers in the output vector (described with pcre2_match() below). To |
| do the conversion, you need to use the name-to-number map, which is |
| described by these three values. |
| |
| The map consists of a number of fixed-size entries. PCRE2_INFO_NAME- |
| COUNT gives the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives |
| the size of each entry in code units; both of these return a uint32_t |
| value. The entry size depends on the length of the longest name. |
| |
| PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table. |
| This is a PCRE2_SPTR pointer to a block of code units. In the 8-bit |
| library, the first two bytes of each entry are the number of the cap- |
| turing parenthesis, most significant byte first. In the 16-bit library, |
| the pointer points to 16-bit code units, the first of which contains |
| the parenthesis number. In the 32-bit library, the pointer points to |
| 32-bit code units, the first of which contains the parenthesis number. |
| The rest of the entry is the corresponding name, zero terminated. |
| |
| The names are in alphabetical order. If (?| is used to create multiple |
| groups with the same number, as described in the section on duplicate |
| subpattern numbers in the pcre2pattern page, the groups may be given |
| the same name, but there is only one entry in the table. Different |
| names for groups of the same number are not permitted. |
| |
| Duplicate names for subpatterns with different numbers are permitted, |
| but only if PCRE2_DUPNAMES is set. They appear in the table in the |
| order in which they were found in the pattern. In the absence of (?| |
| this is the order of increasing number; when (?| is used this is not |
| necessarily the case because later subpatterns may have lower numbers. |
| |
| As a simple example of the name/number table, consider the following |
| pattern after compilation by the 8-bit library (assume PCRE2_EXTENDED |
| is set, so white space - including newlines - is ignored): |
| |
| (?<date> (?<year>(\d\d)?\d\d) - |
| (?<month>\d\d) - (?<day>\d\d) ) |
| |
| There are four named subpatterns, so the table has four entries, and |
| each entry in the table is eight bytes long. The table is as follows, |
| with non-printing bytes shows in hexadecimal, and undefined bytes shown |
| as ??: |
| |
| 00 01 d a t e 00 ?? |
| 00 05 d a y 00 ?? ?? |
| 00 04 m o n t h 00 |
| 00 02 y e a r 00 ?? |
| |
| When writing code to extract data from named subpatterns using the |
| name-to-number map, remember that the length of the entries is likely |
| to be different for each compiled pattern. |
| |
| PCRE2_INFO_NEWLINE |
| |
| The output is a uint32_t with one of the following values: |
| |
| PCRE2_NEWLINE_CR Carriage return (CR) |
| PCRE2_NEWLINE_LF Linefeed (LF) |
| PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF) |
| PCRE2_NEWLINE_ANY Any Unicode line ending |
| PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF |
| |
| This specifies the default character sequence that will be recognized |
| as meaning "newline" while matching. |
| |
| PCRE2_INFO_RECURSIONLIMIT |
| |
| If the pattern set a recursion limit by including an item of the form |
| (*LIMIT_RECURSION=nnnn) at the start, the value is returned. The third |
| argument should point to an unsigned 32-bit integer. If no such value |
| has been set, the call to pcre2_pattern_info() returns the error |
| PCRE2_ERROR_UNSET. |
| |
| PCRE2_INFO_SIZE |
| |
| Return the size of the compiled pattern in bytes (for all three |
| libraries). The third argument should point to a size_t variable. This |
| value includes the size of the general data block that precedes the |
| code units of the compiled pattern itself. The value that is used when |
| pcre2_compile() is getting memory in which to place the compiled pat- |
| tern may be slightly larger than the value returned by this option, |
| because there are cases where the code that calculates the size has to |
| over-estimate. Processing a pattern with the JIT compiler does not |
| alter the value returned by this option. |
| |
| |
| INFORMATION ABOUT A PATTERN'S CALLOUTS |
| |
| int pcre2_callout_enumerate(const pcre2_code *code, |
| int (*callback)(pcre2_callout_enumerate_block *, void *), |
| void *user_data); |
| |
| A script language that supports the use of string arguments in callouts |
| might like to scan all the callouts in a pattern before running the |
| match. This can be done by calling pcre2_callout_enumerate(). The first |
| argument is a pointer to a compiled pattern, the second points to a |
| callback function, and the third is arbitrary user data. The callback |
| function is called for every callout in the pattern in the order in |
| which they appear. Its first argument is a pointer to a callout enumer- |
| ation block, and its second argument is the user_data value that was |
| passed to pcre2_callout_enumerate(). The contents of the callout enu- |
| meration block are described in the pcre2callout documentation, which |
| also gives further details about callouts. |
| |
| |
| SERIALIZATION AND PRECOMPILING |
| |
| It is possible to save compiled patterns on disc or elsewhere, and |
| reload them later, subject to a number of restrictions. The functions |
| whose names begin with pcre2_serialize_ are used for this purpose. They |
| are described in the pcre2serialize documentation. |
| |
| |
| THE MATCH DATA BLOCK |
| |
| pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize, |
| pcre2_general_context *gcontext); |
| |
| pcre2_match_data *pcre2_match_data_create_from_pattern( |
| const pcre2_code *code, pcre2_general_context *gcontext); |
| |
| void pcre2_match_data_free(pcre2_match_data *match_data); |
| |
| Information about a successful or unsuccessful match is placed in a |
| match data block, which is an opaque structure that is accessed by |
| function calls. In particular, the match data block contains a vector |
| of offsets into the subject string that define the matched part of the |
| subject and any substrings that were captured. This is know as the |
| ovector. |
| |
| Before calling pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match() |
| you must create a match data block by calling one of the creation func- |
| tions above. For pcre2_match_data_create(), the first argument is the |
| number of pairs of offsets in the ovector. One pair of offsets is |
| required to identify the string that matched the whole pattern, with |
| another pair for each captured substring. For example, a value of 4 |
| creates enough space to record the matched portion of the subject plus |
| three captured substrings. A minimum of at least 1 pair is imposed by |
| pcre2_match_data_create(), so it is always possible to return the over- |
| all matched string. |
| |
| The second argument of pcre2_match_data_create() is a pointer to a gen- |
| eral context, which can specify custom memory management for obtaining |
| the memory for the match data block. If you are not using custom memory |
| management, pass NULL, which causes malloc() to be used. |
| |
| For pcre2_match_data_create_from_pattern(), the first argument is a |
| pointer to a compiled pattern. The ovector is created to be exactly the |
| right size to hold all the substrings a pattern might capture. The sec- |
| ond argument is again a pointer to a general context, but in this case |
| if NULL is passed, the memory is obtained using the same allocator that |
| was used for the compiled pattern (custom or default). |
| |
| A match data block can be used many times, with the same or different |
| compiled patterns. You can extract information from a match data block |
| after a match operation has finished, using functions that are |
| described in the sections on matched strings and other match data |
| below. |
| |
| When a call of pcre2_match() fails, valid data is available in the |
| match block only when the error is PCRE2_ERROR_NOMATCH, |
| PCRE2_ERROR_PARTIAL, or one of the error codes for an invalid UTF |
| string. Exactly what is available depends on the error, and is detailed |
| below. |
| |
| When one of the matching functions is called, pointers to the compiled |
| pattern and the subject string are set in the match data block so that |
| they can be referenced by the extraction functions. After running a |
| match, you must not free a compiled pattern or a subject string until |
| after all operations on the match data block (for that match) have |
| taken place. |
| |
| When a match data block itself is no longer needed, it should be freed |
| by calling pcre2_match_data_free(). |
| |
| |
| MATCHING A PATTERN: THE TRADITIONAL FUNCTION |
| |
| int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject, |
| PCRE2_SIZE length, PCRE2_SIZE startoffset, |
| uint32_t options, pcre2_match_data *match_data, |
| pcre2_match_context *mcontext); |
| |
| The function pcre2_match() is called to match a subject string against |
| a compiled pattern, which is passed in the code argument. You can call |
| pcre2_match() with the same code argument as many times as you like, in |
| order to find multiple matches in the subject string or to match dif- |
| ferent subject strings with the same pattern. |
| |
| This function is the main matching facility of the library, and it |
| operates in a Perl-like manner. For specialist use there is also an |
| alternative matching function, which is described below in the section |
| about the pcre2_dfa_match() function. |
| |
| Here is an example of a simple call to pcre2_match(): |
| |
| pcre2_match_data *md = pcre2_match_data_create(4, NULL); |
| int rc = pcre2_match( |
| re, /* result of pcre2_compile() */ |
| "some string", /* the subject string */ |
| 11, /* the length of the subject string */ |
| 0, /* start at offset 0 in the subject */ |
| 0, /* default options */ |
| match_data, /* the match data block */ |
| NULL); /* a match context; NULL means use defaults */ |
| |
| If the subject string is zero-terminated, the length can be given as |
| PCRE2_ZERO_TERMINATED. A match context must be provided if certain less |
| common matching parameters are to be changed. For details, see the sec- |
| tion on the match context above. |
| |
| The string to be matched by pcre2_match() |
| |
| The subject string is passed to pcre2_match() as a pointer in subject, |
| a length in length, and a starting offset in startoffset. The length |
| and offset are in code units, not characters. That is, they are in |
| bytes for the 8-bit library, 16-bit code units for the 16-bit library, |
| and 32-bit code units for the 32-bit library, whether or not UTF pro- |
| cessing is enabled. |
| |
| If startoffset is greater than the length of the subject, pcre2_match() |
| returns PCRE2_ERROR_BADOFFSET. When the starting offset is zero, the |
| search for a match starts at the beginning of the subject, and this is |
| by far the most common case. In UTF-8 or UTF-16 mode, the starting off- |
| set must point to the start of a character, or to the end of the sub- |
| ject (in UTF-32 mode, one code unit equals one character, so all off- |
| sets are valid). Like the pattern string, the subject may contain |
| binary zeroes. |
| |
| A non-zero starting offset is useful when searching for another match |
| in the same subject by calling pcre2_match() again after a previous |
| success. Setting startoffset differs from passing over a shortened |
| string and setting PCRE2_NOTBOL in the case of a pattern that begins |
| with any kind of lookbehind. For example, consider the pattern |
| |
| \Biss\B |
| |
| which finds occurrences of "iss" in the middle of words. (\B matches |
| only if the current position in the subject is not a word boundary.) |
| When applied to the string "Mississipi" the first call to pcre2_match() |
| finds the first occurrence. If pcre2_match() is called again with just |
| the remainder of the subject, namely "issipi", it does not match, |
| because \B is always false at the start of the subject, which is deemed |
| to be a word boundary. However, if pcre2_match() is passed the entire |
| string again, but with startoffset set to 4, it finds the second occur- |
| rence of "iss" because it is able to look behind the starting point to |
| discover that it is preceded by a letter. |
| |
| Finding all the matches in a subject is tricky when the pattern can |
| match an empty string. It is possible to emulate Perl's /g behaviour by |
| first trying the match again at the same offset, with the |
| PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED options, and then if that |
| fails, advancing the starting offset and trying an ordinary match |
| again. There is some code that demonstrates how to do this in the |
| pcre2demo sample program. In the most general case, you have to check |
| to see if the newline convention recognizes CRLF as a newline, and if |
| so, and the current character is CR followed by LF, advance the start- |
| ing offset by two characters instead of one. |
| |
| If a non-zero starting offset is passed when the pattern is anchored, |
| one attempt to match at the given offset is made. This can only succeed |
| if the pattern does not require the match to be at the start of the |
| subject. |
| |
| Option bits for pcre2_match() |
| |
| The unused bits of the options argument for pcre2_match() must be zero. |
| The only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL, |
| PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, |
| PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their |
| action is described below. |
| |
| Setting PCRE2_ANCHORED at match time is not supported by the just-in- |
| time (JIT) compiler. If it is set, JIT matching is disabled and the |
| normal interpretive code in pcre2_match() is run. The remaining options |
| are supported for JIT matching. |
| |
| PCRE2_ANCHORED |
| |
| The PCRE2_ANCHORED option limits pcre2_match() to matching at the first |
| matching position. If a pattern was compiled with PCRE2_ANCHORED, or |
| turned out to be anchored by virtue of its contents, it cannot be made |
| unachored at matching time. Note that setting the option at match time |
| disables JIT matching. |
| |
| PCRE2_NOTBOL |
| |
| This option specifies that first character of the subject string is not |
| the beginning of a line, so the circumflex metacharacter should not |
| match before it. Setting this without having set PCRE2_MULTILINE at |
| compile time causes circumflex never to match. This option affects only |
| the behaviour of the circumflex metacharacter. It does not affect \A. |
| |
| PCRE2_NOTEOL |
| |
| This option specifies that the end of the subject string is not the end |
| of a line, so the dollar metacharacter should not match it nor (except |
| in multiline mode) a newline immediately before it. Setting this with- |
| out having set PCRE2_MULTILINE at compile time causes dollar never to |
| match. This option affects only the behaviour of the dollar metacharac- |
| ter. It does not affect \Z or \z. |
| |
| PCRE2_NOTEMPTY |
| |
| An empty string is not considered to be a valid match if this option is |
| set. If there are alternatives in the pattern, they are tried. If all |
| the alternatives match the empty string, the entire match fails. For |
| example, if the pattern |
| |
| a?b? |
| |
| is applied to a string not beginning with "a" or "b", it matches an |
| empty string at the start of the subject. With PCRE2_NOTEMPTY set, this |
| match is not valid, so pcre2_match() searches further into the string |
| for occurrences of "a" or "b". |
| |
| PCRE2_NOTEMPTY_ATSTART |
| |
| This is like PCRE2_NOTEMPTY, except that it locks out an empty string |
| match only at the first matching position, that is, at the start of the |
| subject plus the starting offset. An empty string match later in the |
| subject is permitted. If the pattern is anchored, such a match can |
| occur only if the pattern contains \K. |
| |
| PCRE2_NO_UTF_CHECK |
| |
| When PCRE2_UTF is set at compile time, the validity of the subject as a |
| UTF string is checked by default when pcre2_match() is subsequently |
| called. If a non-zero starting offset is given, the check is applied |
| only to that part of the subject that could be inspected during match- |
| ing, and there is a check that the starting offset points to the first |
| code unit of a character or to the end of the subject. If there are no |
| lookbehind assertions in the pattern, the check starts at the starting |
| offset. Otherwise, it starts at the length of the longest lookbehind |
| before the starting offset, or at the start of the subject if there are |
| not that many characters before the starting offset. Note that the |
| sequences \b and \B are one-character lookbehinds. |
| |
| The check is carried out before any other processing takes place, and a |
| negative error code is returned if the check fails. There are several |
| UTF error codes for each code unit width, corresponding to different |
| problems with the code unit sequence. There are discussions about the |
| validity of UTF-8 strings, UTF-16 strings, and UTF-32 strings in the |
| pcre2unicode page. |
| |
| If you know that your subject is valid, and you want to skip these |
| checks for performance reasons, you can set the PCRE2_NO_UTF_CHECK |
| option when calling pcre2_match(). You might want to do this for the |
| second and subsequent calls to pcre2_match() if you are making repeated |
| calls to find all the matches in a single subject string. |
| |
| NOTE: When PCRE2_NO_UTF_CHECK is set, the effect of passing an invalid |
| string as a subject, or an invalid value of startoffset, is undefined. |
| Your program may crash or loop indefinitely. |
| |
| PCRE2_PARTIAL_HARD |
| PCRE2_PARTIAL_SOFT |
| |
| These options turn on the partial matching feature. A partial match |
| occurs if the end of the subject string is reached successfully, but |
| there are not enough subject characters to complete the match. If this |
| happens when PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set, |
| matching continues by testing any remaining alternatives. Only if no |
| complete match can be found is PCRE2_ERROR_PARTIAL returned instead of |
| PCRE2_ERROR_NOMATCH. In other words, PCRE2_PARTIAL_SOFT specifies that |
| the caller is prepared to handle a partial match, but only if no com- |
| plete match can be found. |
| |
| If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this |
| case, if a partial match is found, pcre2_match() immediately returns |
| PCRE2_ERROR_PARTIAL, without considering any other alternatives. In |
| other words, when PCRE2_PARTIAL_HARD is set, a partial match is consid- |
| ered to be more important that an alternative complete match. |
| |
| There is a more detailed discussion of partial and multi-segment match- |
| ing, with examples, in the pcre2partial documentation. |
| |
| |
| NEWLINE HANDLING WHEN MATCHING |
| |
| When PCRE2 is built, a default newline convention is set; this is usu- |
| ally the standard convention for the operating system. The default can |
| be overridden in a compile context by calling pcre2_set_newline(). It |
| can also be overridden by starting a pattern string with, for example, |
| (*CRLF), as described in the section on newline conventions in the |
| pcre2pattern page. During matching, the newline choice affects the be- |
| haviour of the dot, circumflex, and dollar metacharacters. It may also |
| alter the way the match starting position is advanced after a match |
| failure for an unanchored pattern. |
| |
| When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is |
| set as the newline convention, and a match attempt for an unanchored |
| pattern fails when the current starting position is at a CRLF sequence, |
| and the pattern contains no explicit matches for CR or LF characters, |
| the match position is advanced by two characters instead of one, in |
| other words, to after the CRLF. |
| |
| The above rule is a compromise that makes the most common cases work as |
| expected. For example, if the pattern is .+A (and the PCRE2_DOTALL |
| option is not set), it does not match the string "\r\nA" because, after |
| failing at the start, it skips both the CR and the LF before retrying. |
| However, the pattern [\r\n]A does match that string, because it con- |
| tains an explicit CR or LF reference, and so advances only by one char- |
| acter after the first failure. |
| |
| An explicit match for CR of LF is either a literal appearance of one of |
| those characters in the pattern, or one of the \r or \n escape |
| sequences. Implicit matches such as [^X] do not count, nor does \s, |
| even though it includes CR and LF in the characters that it matches. |
| |
| Notwithstanding the above, anomalous effects may still occur when CRLF |
| is a valid newline sequence and explicit \r or \n escapes appear in the |
| pattern. |
| |
| |
| HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS |
| |
| uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data); |
| |
| PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data); |
| |
| In general, a pattern matches a certain portion of the subject, and in |
| addition, further substrings from the subject may be picked out by |
| parenthesized parts of the pattern. Following the usage in Jeffrey |
| Friedl's book, this is called "capturing" in what follows, and the |
| phrase "capturing subpattern" or "capturing group" is used for a frag- |
| ment of a pattern that picks out a substring. PCRE2 supports several |
| other kinds of parenthesized subpattern that do not cause substrings to |
| be captured. The pcre2_pattern_info() function can be used to find out |
| how many capturing subpatterns there are in a compiled pattern. |
| |
| You can use auxiliary functions for accessing captured substrings by |
| number or by name, as described in sections below. |
| |
| Alternatively, you can make direct use of the vector of PCRE2_SIZE val- |
| ues, called the ovector, which contains the offsets of captured |
| strings. It is part of the match data block. The function |
| pcre2_get_ovector_pointer() returns the address of the ovector, and |
| pcre2_get_ovector_count() returns the number of pairs of values it con- |
| tains. |
| |
| Within the ovector, the first in each pair of values is set to the off- |
| set of the first code unit of a substring, and the second is set to the |
| offset of the first code unit after the end of a substring. These val- |
| ues are always code unit offsets, not character offsets. That is, they |
| are byte offsets in the 8-bit library, 16-bit offsets in the 16-bit |
| library, and 32-bit offsets in the 32-bit library. |
| |
| After a partial match (error return PCRE2_ERROR_PARTIAL), only the |
| first pair of offsets (that is, ovector[0] and ovector[1]) are set. |
| They identify the part of the subject that was partially matched. See |
| the pcre2partial documentation for details of partial matching. |
| |
| After a successful match, the first pair of offsets identifies the por- |
| tion of the subject string that was matched by the entire pattern. The |
| next pair is used for the first capturing subpattern, and so on. The |
| value returned by pcre2_match() is one more than the highest numbered |
| pair that has been set. For example, if two substrings have been cap- |
| tured, the returned value is 3. If there are no capturing subpatterns, |
| the return value from a successful match is 1, indicating that just the |
| first pair of offsets has been set. |
| |
| If a pattern uses the \K escape sequence within a positive assertion, |
| the reported start of a successful match can be greater than the end of |
| the match. For example, if the pattern (?=ab\K) is matched against |
| "ab", the start and end offset values for the match are 2 and 0. |
| |
| If a capturing subpattern group is matched repeatedly within a single |
| match operation, it is the last portion of the subject that it matched |
| that is returned. |
| |
| If the ovector is too small to hold all the captured substring offsets, |
| as much as possible is filled in, and the function returns a value of |
| zero. If captured substrings are not of interest, pcre2_match() may be |
| called with a match data block whose ovector is of minimum length (that |
| is, one pair). However, if the pattern contains back references and the |
| ovector is not big enough to remember the related substrings, PCRE2 has |
| to get additional memory for use during matching. Thus it is usually |
| advisable to set up a match data block containing an ovector of reason- |
| able size. |
| |
| It is possible for capturing subpattern number n+1 to match some part |
| of the subject when subpattern n has not been used at all. For example, |
| if the string "abc" is matched against the pattern (a|(z))(bc) the |
| return from the function is 4, and subpatterns 1 and 3 are matched, but |
| 2 is not. When this happens, both values in the offset pairs corre- |
| sponding to unused subpatterns are set to PCRE2_UNSET. |
| |
| Offset values that correspond to unused subpatterns at the end of the |
| expression are also set to PCRE2_UNSET. For example, if the string |
| "abc" is matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 |
| are not matched. The return from the function is 2, because the high- |
| est used capturing subpattern number is 1. The offsets for for the sec- |
| ond and third capturing subpatterns (assuming the vector is large |
| enough, of course) are set to PCRE2_UNSET. |
| |
| Elements in the ovector that do not correspond to capturing parentheses |
| in the pattern are never changed. That is, if a pattern contains n cap- |
| turing parentheses, no more than ovector[0] to ovector[2n+1] are set by |
| pcre2_match(). The other elements retain whatever values they previ- |
| ously had. |
| |
| |
| OTHER INFORMATION ABOUT A MATCH |
| |
| PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data); |
| |
| PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data); |
| |
| As well as the offsets in the ovector, other information about a match |
| is retained in the match data block and can be retrieved by the above |
| functions in appropriate circumstances. If they are called at other |
| times, the result is undefined. |
| |
| After a successful match, a partial match (PCRE2_ERROR_PARTIAL), or a |
| failure to match (PCRE2_ERROR_NOMATCH), a (*MARK) name may be avail- |
| able, and pcre2_get_mark() can be called. It returns a pointer to the |
| zero-terminated name, which is within the compiled pattern. Otherwise |
| NULL is returned. The length of the (*MARK) name (excluding the termi- |
| nating zero) is stored in the code unit that preceeds the name. You |
| should use this instead of relying on the terminating zero if the |
| (*MARK) name might contain a binary zero. |
| |
| After a successful match, the (*MARK) name that is returned is the last |
| one encountered on the matching path through the pattern. After a "no |
| match" or a partial match, the last encountered (*MARK) name is |
| returned. For example, consider this pattern: |
| |
| ^(*MARK:A)((*MARK:B)a|b)c |
| |
| When it matches "bc", the returned mark is A. The B mark is "seen" in |
| the first branch of the group, but it is not on the matching path. On |
| the other hand, when this pattern fails to match "bx", the returned |
| mark is B. |
| |
| After a successful match, a partial match, or one of the invalid UTF |
| errors (for example, PCRE2_ERROR_UTF8_ERR5), pcre2_get_startchar() can |
| be called. After a successful or partial match it returns the code unit |
| offset of the character at which the match started. For a non-partial |
| match, this can be different to the value of ovector[0] if the pattern |
| contains the \K escape sequence. After a partial match, however, this |
| value is always the same as ovector[0] because \K does not affect the |
| result of a partial match. |
| |
| After a UTF check failure, pcre2_get_startchar() can be used to obtain |
| the code unit offset of the invalid UTF character. Details are given in |
| the pcre2unicode page. |
| |
| |
| ERROR RETURNS FROM pcre2_match() |
| |
| If pcre2_match() fails, it returns a negative number. This can be con- |
| verted to a text string by calling pcre2_get_error_message(). Negative |
| error codes are also returned by other functions, and are documented |
| with them. The codes are given names in the header file. If UTF check- |
| ing is in force and an invalid UTF subject string is detected, one of a |
| number of UTF-specific negative error codes is returned. Details are |
| given in the pcre2unicode page. The following are the other errors that |
| may be returned by pcre2_match(): |
| |
| PCRE2_ERROR_NOMATCH |
| |
| The subject string did not match the pattern. |
| |
| PCRE2_ERROR_PARTIAL |
| |
| The subject string did not match, but it did match partially. See the |
| pcre2partial documentation for details of partial matching. |
| |
| PCRE2_ERROR_BADMAGIC |
| |
| PCRE2 stores a 4-byte "magic number" at the start of the compiled code, |
| to catch the case when it is passed a junk pointer. This is the error |
| that is returned when the magic number is not present. |
| |
| PCRE2_ERROR_BADMODE |
| |
| This error is given when a pattern that was compiled by the 8-bit |
| library is passed to a 16-bit or 32-bit library function, or vice |
| versa. |
| |
| PCRE2_ERROR_BADOFFSET |
| |
| The value of startoffset was greater than the length of the subject. |
| |
| PCRE2_ERROR_BADOPTION |
| |
| An unrecognized bit was set in the options argument. |
| |
| PCRE2_ERROR_BADUTFOFFSET |
| |
| The UTF code unit sequence that was passed as a subject was checked and |
| found to be valid (the PCRE2_NO_UTF_CHECK option was not set), but the |
| value of startoffset did not point to the beginning of a UTF character |
| or the end of the subject. |
| |
| PCRE2_ERROR_CALLOUT |
| |
| This error is never generated by pcre2_match() itself. It is provided |
| for use by callout functions that want to cause pcre2_match() or |
| pcre2_callout_enumerate() to return a distinctive error code. See the |
| pcre2callout documentation for details. |
| |
| PCRE2_ERROR_INTERNAL |
| |
| An unexpected internal error has occurred. This error could be caused |
| by a bug in PCRE2 or by overwriting of the compiled pattern. |
| |
| PCRE2_ERROR_JIT_BADOPTION |
| |
| This error is returned when a pattern that was successfully studied |
| using JIT is being matched, but the matching mode (partial or complete |
| match) does not correspond to any JIT compilation mode. When the JIT |
| fast path function is used, this error may be also given for invalid |
| options. See the pcre2jit documentation for more details. |
| |
| PCRE2_ERROR_JIT_STACKLIMIT |
| |
| This error is returned when a pattern that was successfully studied |
| using JIT is being matched, but the memory available for the just-in- |
| time processing stack is not large enough. See the pcre2jit documenta- |
| tion for more details. |
| |
| PCRE2_ERROR_MATCHLIMIT |
| |
| The backtracking limit was reached. |
| |
| PCRE2_ERROR_NOMEMORY |
| |
| If a pattern contains back references, but the ovector is not big |
| enough to remember the referenced substrings, PCRE2 gets a block of |
| memory at the start of matching to use for this purpose. There are some |
| other special cases where extra memory is needed during matching. This |
| error is given when memory cannot be obtained. |
| |
| PCRE2_ERROR_NULL |
| |
| Either the code, subject, or match_data argument was passed as NULL. |
| |
| PCRE2_ERROR_RECURSELOOP |
| |
| This error is returned when pcre2_match() detects a recursion loop |
| within the pattern. Specifically, it means that either the whole pat- |
| tern or a subpattern has been called recursively for the second time at |
| the same position in the subject string. Some simple patterns that |
| might do this are detected and faulted at compile time, but more com- |
| plicated cases, in particular mutual recursions between two different |
| subpatterns, cannot be detected until matching is attempted. |
| |
| PCRE2_ERROR_RECURSIONLIMIT |
| |
| The internal recursion limit was reached. |
| |
| |
| EXTRACTING CAPTURED SUBSTRINGS BY NUMBER |
| |
| int pcre2_substring_length_bynumber(pcre2_match_data *match_data, |
| uint32_t number, PCRE2_SIZE *length); |
| |
| int pcre2_substring_copy_bynumber(pcre2_match_data *match_data, |
| uint32_t number, PCRE2_UCHAR *buffer, |
| PCRE2_SIZE *bufflen); |
| |
| int pcre2_substring_get_bynumber(pcre2_match_data *match_data, |
| uint32_t number, PCRE2_UCHAR **bufferptr, |
| PCRE2_SIZE *bufflen); |
| |
| void pcre2_substring_free(PCRE2_UCHAR *buffer); |
| |
| Captured substrings can be accessed directly by using the ovector as |
| described above. For convenience, auxiliary functions are provided for |
| extracting captured substrings as new, separate, zero-terminated |
| strings. A substring that contains a binary zero is correctly extracted |
| and has a further zero added on the end, but the result is not, of |
| course, a C string. |
| |
| The functions in this section identify substrings by number. The number |
| zero refers to the entire matched substring, with higher numbers refer- |
| ring to substrings captured by parenthesized groups. After a partial |
| match, only substring zero is available. An attempt to extract any |
| other substring gives the error PCRE2_ERROR_PARTIAL. The next section |
| describes similar functions for extracting captured substrings by name. |
| |
| If a pattern uses the \K escape sequence within a positive assertion, |
| the reported start of a successful match can be greater than the end of |
| the match. For example, if the pattern (?=ab\K) is matched against |
| "ab", the start and end offset values for the match are 2 and 0. In |
| this situation, calling these functions with a zero substring number |
| extracts a zero-length empty string. |
| |
| You can find the length in code units of a captured substring without |
| extracting it by calling pcre2_substring_length_bynumber(). The first |
| argument is a pointer to the match data block, the second is the group |
| number, and the third is a pointer to a variable into which the length |
| is placed. If you just want to know whether or not the substring has |
| been captured, you can pass the third argument as NULL. |
| |
| The pcre2_substring_copy_bynumber() function copies a captured sub- |
| string into a supplied buffer, whereas pcre2_substring_get_bynumber() |
| copies it into new memory, obtained using the same memory allocation |
| function that was used for the match data block. The first two argu- |
| ments of these functions are a pointer to the match data block and a |
| capturing group number. |
| |
| The final arguments of pcre2_substring_copy_bynumber() are a pointer to |
| the buffer and a pointer to a variable that contains its length in code |
| units. This is updated to contain the actual number of code units used |
| for the extracted substring, excluding the terminating zero. |
| |
| For pcre2_substring_get_bynumber() the third and fourth arguments point |
| to variables that are updated with a pointer to the new memory and the |
| number of code units that comprise the substring, again excluding the |
| terminating zero. When the substring is no longer needed, the memory |
| should be freed by calling pcre2_substring_free(). |
| |
| The return value from all these functions is zero for success, or a |
| negative error code. If the pattern match failed, the match failure |
| code is returned. If a substring number greater than zero is used |
| after a partial match, PCRE2_ERROR_PARTIAL is returned. Other possible |
| error codes are: |
| |
| PCRE2_ERROR_NOMEMORY |
| |
| The buffer was too small for pcre2_substring_copy_bynumber(), or the |
| attempt to get memory failed for pcre2_substring_get_bynumber(). |
| |
| PCRE2_ERROR_NOSUBSTRING |
| |
| There is no substring with that number in the pattern, that is, the |
| number is greater than the number of capturing parentheses. |
| |
| PCRE2_ERROR_UNAVAILABLE |
| |
| The substring number, though not greater than the number of captures in |
| the pattern, is greater than the number of slots in the ovector, so the |
| substring could not be captured. |
| |
| PCRE2_ERROR_UNSET |
| |
| The substring did not participate in the match. For example, if the |
| pattern is (abc)|(def) and the subject is "def", and the ovector con- |
| tains at least two capturing slots, substring number 1 is unset. |
| |
| |
| EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS |
| |
| int pcre2_substring_list_get(pcre2_match_data *match_data, |
| PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr); |
| |
| void pcre2_substring_list_free(PCRE2_SPTR *list); |
| |
| The pcre2_substring_list_get() function extracts all available sub- |
| strings and builds a list of pointers to them. It also (optionally) |
| builds a second list that contains their lengths (in code units), |
| excluding a terminating zero that is added to each of them. All this is |
| done in a single block of memory that is obtained using the same memory |
| allocation function that was used to get the match data block. |
| |
| This function must be called only after a successful match. If called |
| after a partial match, the error code PCRE2_ERROR_PARTIAL is returned. |
| |
| The address of the memory block is returned via listptr, which is also |
| the start of the list of string pointers. The end of the list is marked |
| by a NULL pointer. The address of the list of lengths is returned via |
| lengthsptr. If your strings do not contain binary zeros and you do not |
| therefore need the lengths, you may supply NULL as the lengthsptr argu- |
| ment to disable the creation of a list of lengths. The yield of the |
| function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the mem- |
| ory block could not be obtained. When the list is no longer needed, it |
| should be freed by calling pcre2_substring_list_free(). |
| |
| If this function encounters a substring that is unset, which can happen |
| when capturing subpattern number n+1 matches some part of the subject, |
| but subpattern n has not been used at all, it returns an empty string. |
| This can be distinguished from a genuine zero-length substring by |
| inspecting the appropriate offset in the ovector, which contain |
| PCRE2_UNSET for unset substrings, or by calling pcre2_sub- |
| string_length_bynumber(). |
| |
| |
| EXTRACTING CAPTURED SUBSTRINGS BY NAME |
| |
| int pcre2_substring_number_from_name(const pcre2_code *code, |
| PCRE2_SPTR name); |
| |
| int pcre2_substring_length_byname(pcre2_match_data *match_data, |
| PCRE2_SPTR name, PCRE2_SIZE *length); |
| |
| int pcre2_substring_copy_byname(pcre2_match_data *match_data, |
| PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen); |
| |
| int pcre2_substring_get_byname(pcre2_match_data *match_data, |
| PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen); |
| |
| void pcre2_substring_free(PCRE2_UCHAR *buffer); |
| |
| To extract a substring by name, you first have to find associated num- |
| ber. For example, for this pattern: |
| |
| (a+)b(?<xxx>\d+)... |
| |
| the number of the subpattern called "xxx" is 2. If the name is known to |
| be unique (PCRE2_DUPNAMES was not set), you can find the number from |
| the name by calling pcre2_substring_number_from_name(). The first argu- |
| ment is the compiled pattern, and the second is the name. The yield of |
| the function is the subpattern number, PCRE2_ERROR_NOSUBSTRING if there |
| is no subpattern of that name, or PCRE2_ERROR_NOUNIQUESUBSTRING if |
| there is more than one subpattern of that name. Given the number, you |
| can extract the substring directly, or use one of the functions |
| described above. |
| |
| For convenience, there are also "byname" functions that correspond to |
| the "bynumber" functions, the only difference being that the second |
| argument is a name instead of a number. If PCRE2_DUPNAMES is set and |
| there are duplicate names, these functions scan all the groups with the |
| given name, and return the first named string that is set. |
| |
| If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is |
| returned. If all groups with the name have numbers that are greater |
| than the number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is |
| returned. If there is at least one group with a slot in the ovector, |
| but no group is found to be set, PCRE2_ERROR_UNSET is returned. |
| |
| Warning: If the pattern uses the (?| feature to set up multiple subpat- |
| terns with the same number, as described in the section on duplicate |
| subpattern numbers in the pcre2pattern page, you cannot use names to |
| distinguish the different subpatterns, because names are not included |
| in the compiled code. The matching process uses only numbers. For this |
| reason, the use of different names for subpatterns of the same number |
| causes an error at compile time. |
| |
| |
| CREATING A NEW STRING WITH SUBSTITUTIONS |
| |
| int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject, |
| PCRE2_SIZE length, PCRE2_SIZE startoffset, |
| uint32_t options, pcre2_match_data *match_data, |
| pcre2_match_context *mcontext, PCRE2_SPTR replacement, |
| PCRE2_SIZE rlength, PCRE2_UCHAR *outputbufferP, |
| PCRE2_SIZE *outlengthptr); |
| |
| This function calls pcre2_match() and then makes a copy of the subject |
| string in outputbuffer, replacing the part that was matched with the |
| replacement string, whose length is supplied in rlength. This can be |
| given as PCRE2_ZERO_TERMINATED for a zero-terminated string. Matches in |
| which a \K item in a lookahead in the pattern causes the match to end |
| before it starts are not supported, and give rise to an error return. |
| |
| The first seven arguments of pcre2_substitute() are the same as for |
| pcre2_match(), except that the partial matching options are not permit- |
| ted, and match_data may be passed as NULL, in which case a match data |
| block is obtained and freed within this function, using memory manage- |
| ment functions from the match context, if provided, or else those that |
| were used to allocate memory for the compiled code. |
| |
| The outlengthptr argument must point to a variable that contains the |
| length, in code units, of the output buffer. If the function is suc- |
| cessful, the value is updated to contain the length of the new string, |
| excluding the trailing zero that is automatically added. |
| |
| If the function is not successful, the value set via outlengthptr |
| depends on the type of error. For syntax errors in the replacement |
| string, the value is the offset in the replacement string where the |
| error was detected. For other errors, the value is PCRE2_UNSET by |
| default. This includes the case of the output buffer being too small, |
| unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set (see below), in which |
| case the value is the minimum length needed, including space for the |
| trailing zero. Note that in order to compute the required length, |
| pcre2_substitute() has to simulate all the matching and copying, |
| instead of giving an error return as soon as the buffer overflows. Note |
| also that the length is in code units, not bytes. |
| |
| In the replacement string, which is interpreted as a UTF string in UTF |
| mode, and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK |
| option is set, a dollar character is an escape character that can spec- |
| ify the insertion of characters from capturing groups or (*MARK) items |
| in the pattern. The following forms are always recognized: |
| |
| $$ insert a dollar character |
| $<n> or ${<n>} insert the contents of group <n> |
| $*MARK or ${*MARK} insert the name of the last (*MARK) encountered |
| |
| Either a group number or a group name can be given for <n>. Curly |
| brackets are required only if the following character would be inter- |
| preted as part of the number or name. The number may be zero to include |
| the entire matched string. For example, if the pattern a(b)c is |
| matched with "=abc=" and the replacement string "+$1$0$1+", the result |
| is "=+babcb+=". |
| |
| The facility for inserting a (*MARK) name can be used to perform simple |
| simultaneous substitutions, as this pcre2test example shows: |
| |
| /(*:pear)apple|(*:orange)lemon/g,replace=${*MARK} |
| apple lemon |
| 2: pear orange |
| |
| As well as the usual options for pcre2_match(), a number of additional |
| options can be set in the options argument. |
| |
| PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject |
| string, replacing every matching substring. If this is not set, only |
| the first matching substring is replaced. If any matched substring has |
| zero length, after the substitution has happened, an attempt to find a |
| non-empty match at the same position is performed. If this is not suc- |
| cessful, the current position is advanced by one character except when |
| CRLF is a valid newline sequence and the next two characters are CR, |
| LF. In this case, the current position is advanced by two characters. |
| |
| PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output |
| buffer is too small. The default action is to return PCRE2_ERROR_NOMEM- |
| ORY immediately. If this option is set, however, pcre2_substitute() |
| continues to go through the motions of matching and substituting (with- |
| out, of course, writing anything) in order to compute the size of buf- |
| fer that is needed. This value is passed back via the outlengthptr |
| variable, with the result of the function still being |
| PCRE2_ERROR_NOMEMORY. |
| |
| Passing a buffer size of zero is a permitted way of finding out how |
| much memory is needed for given substitution. However, this does mean |
| that the entire operation is carried out twice. Depending on the appli- |
| cation, it may be more efficient to allocate a large buffer and free |
| the excess afterwards, instead of using PCRE2_SUBSTITUTE_OVER- |
| FLOW_LENGTH. |
| |
| PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capturing groups |
| that do not appear in the pattern to be treated as unset groups. This |
| option should be used with care, because it means that a typo in a |
| group name or number no longer causes the PCRE2_ERROR_NOSUBSTRING |
| error. |
| |
| PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capturing groups (including |
| unknown groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be |
| treated as empty strings when inserted as described above. If this |
| option is not set, an attempt to insert an unset group causes the |
| PCRE2_ERROR_UNSET error. This option does not influence the extended |
| substitution syntax described below. |
| |
| PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the |
| replacement string. Without this option, only the dollar character is |
| special, and only the group insertion forms listed above are valid. |
| When PCRE2_SUBSTITUTE_EXTENDED is set, two things change: |
| |
| Firstly, backslash in a replacement string is interpreted as an escape |
| character. The usual forms such as \n or \x{ddd} can be used to specify |
| particular character codes, and backslash followed by any non-alphanu- |
| meric character quotes that character. Extended quoting can be coded |
| using \Q...\E, exactly as in pattern strings. |
| |
| There are also four escape sequences for forcing the case of inserted |
| letters. The insertion mechanism has three states: no case forcing, |
| force upper case, and force lower case. The escape sequences change the |
| current state: \U and \L change to upper or lower case forcing, respec- |
| tively, and \E (when not terminating a \Q quoted sequence) reverts to |
| no case forcing. The sequences \u and \l force the next character (if |
| it is a letter) to upper or lower case, respectively, and then the |
| state automatically reverts to no case forcing. Case forcing applies to |
| all inserted characters, including those from captured groups and let- |
| ters within \Q...\E quoted sequences. |
| |
| Note that case forcing sequences such as \U...\E do not nest. For exam- |
| ple, the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final |
| \E has no effect. |
| |
| The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more |
| flexibility to group substitution. The syntax is similar to that used |
| by Bash: |
| |
| ${<n>:-<string>} |
| ${<n>:+<string1>:<string2>} |
| |
| As before, <n> may be a group number or a name. The first form speci- |
| fies a default value. If group <n> is set, its value is inserted; if |
| not, <string> is expanded and the result inserted. The second form |
| specifies strings that are expanded and inserted when group <n> is set |
| or unset, respectively. The first form is just a convenient shorthand |
| for |
| |
| ${<n>:+${<n>}:<string>} |
| |
| Backslash can be used to escape colons and closing curly brackets in |
| the replacement strings. A change of the case forcing state within a |
| replacement string remains in force afterwards, as shown in this |
| pcre2test example: |
| |
| /(some)?(body)/substitute_extended,replace=${1:+\U:\L}HeLLo |
| body |
| 1: hello |
| somebody |
| 1: HELLO |
| |
| The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended |
| substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause |
| unknown groups in the extended syntax forms to be treated as unset. |
| |
| If successful, pcre2_substitute() returns the number of replacements |
| that were made. This may be zero if no matches were found, and is never |
| greater than 1 unless PCRE2_SUBSTITUTE_GLOBAL is set. |
| |
| In the event of an error, a negative error code is returned. Except for |
| PCRE2_ERROR_NOMATCH (which is never returned), errors from |
| pcre2_match() are passed straight back. |
| |
| PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser- |
| tion, unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set. |
| |
| PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ- |
| ing an unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) |
| when the simple (non-extended) syntax is used and PCRE2_SUBSTI- |
| TUTE_UNSET_EMPTY is not set. |
| |
| PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big |
| enough. If the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size |
| of buffer that is needed is returned via outlengthptr. Note that this |
| does not happen by default. |
| |
| PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in |
| the replacement string, with more particular errors being |
| PCRE2_ERROR_BADREPESCAPE (invalid escape sequence), PCRE2_ERROR_REP- |
| MISSING_BRACE (closing curly bracket not found), PCRE2_BADSUBSTITUTION |
| (syntax error in extended group substitution), and PCRE2_BADSUBPATTERN |
| (the pattern match ended before it started, which can happen if \K is |
| used in an assertion). |
| |
| As for all PCRE2 errors, a text message that describes the error can be |
| obtained by calling pcre2_get_error_message(). |
| |
| |
| DUPLICATE SUBPATTERN NAMES |
| |
| int pcre2_substring_nametable_scan(const pcre2_code *code, |
| PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last); |
| |
| When a pattern is compiled with the PCRE2_DUPNAMES option, names for |
| subpatterns are not required to be unique. Duplicate names are always |
| allowed for subpatterns with the same number, created by using the (?| |
| feature. Indeed, if such subpatterns are named, they are required to |
| use the same names. |
| |
| Normally, patterns with duplicate names are such that in any one match, |
| only one of the named subpatterns participates. An example is shown in |
| the pcre2pattern documentation. |
| |
| When duplicates are present, pcre2_substring_copy_byname() and |
| pcre2_substring_get_byname() return the first substring corresponding |
| to the given name that is set. Only if none are set is |
| PCRE2_ERROR_UNSET is returned. The pcre2_substring_number_from_name() |
| function returns the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are |
| duplicate names. |
| |
| If you want to get full details of all captured substrings for a given |
| name, you must use the pcre2_substring_nametable_scan() function. The |
| first argument is the compiled pattern, and the second is the name. If |
| the third and fourth arguments are NULL, the function returns a group |
| number for a unique name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise. |
| |
| When the third and fourth arguments are not NULL, they must be pointers |
| to variables that are updated by the function. After it has run, they |
| point to the first and last entries in the name-to-number table for the |
| given name, and the function returns the length of each entry in code |
| units. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there are |
| no entries for the given name. |
| |
| The format of the name table is described above in the section entitled |
| Information about a pattern. Given all the relevant entries for the |
| name, you can extract each of their numbers, and hence the captured |
| data. |
| |
| |
| FINDING ALL POSSIBLE MATCHES AT ONE POSITION |
| |
| The traditional matching function uses a similar algorithm to Perl, |
| which stops when it finds the first match at a given point in the sub- |
| ject. If you want to find all possible matches, or the longest possible |
| match at a given position, consider using the alternative matching |
| function (see below) instead. If you cannot use the alternative func- |
| tion, you can kludge it up by making use of the callout facility, which |
| is described in the pcre2callout documentation. |
| |
| What you have to do is to insert a callout right at the end of the pat- |
| tern. When your callout function is called, extract and save the cur- |
| rent matched substring. Then return 1, which forces pcre2_match() to |
| backtrack and try other alternatives. Ultimately, when it runs out of |
| matches, pcre2_match() will yield PCRE2_ERROR_NOMATCH. |
| |
| |
| MATCHING A PATTERN: THE ALTERNATIVE FUNCTION |
| |
| int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject, |
| PCRE2_SIZE length, PCRE2_SIZE startoffset, |
| uint32_t options, pcre2_match_data *match_data, |
| pcre2_match_context *mcontext, |
| int *workspace, PCRE2_SIZE wscount); |
| |
| The function pcre2_dfa_match() is called to match a subject string |
| against a compiled pattern, using a matching algorithm that scans the |
| subject string just once, and does not backtrack. This has different |
| characteristics to the normal algorithm, and is not compatible with |
| Perl. Some of the features of PCRE2 patterns are not supported. Never- |
| theless, there are times when this kind of matching can be useful. For |
| a discussion of the two matching algorithms, and a list of features |
| that pcre2_dfa_match() does not support, see the pcre2matching documen- |
| tation. |
| |
| The arguments for the pcre2_dfa_match() function are the same as for |
| pcre2_match(), plus two extras. The ovector within the match data block |
| is used in a different way, and this is described below. The other com- |
| mon arguments are used in the same way as for pcre2_match(), so their |
| description is not repeated here. |
| |
| The two additional arguments provide workspace for the function. The |
| workspace vector should contain at least 20 elements. It is used for |
| keeping track of multiple paths through the pattern tree. More |
| workspace is needed for patterns and subjects where there are a lot of |
| potential matches. |
| |
| Here is an example of a simple call to pcre2_dfa_match(): |
| |
| int wspace[20]; |
| pcre2_match_data *md = pcre2_match_data_create(4, NULL); |
| int rc = pcre2_dfa_match( |
| re, /* result of pcre2_compile() */ |
| "some string", /* the subject string */ |
| 11, /* the length of the subject string */ |
| 0, /* start at offset 0 in the subject */ |
| 0, /* default options */ |
| match_data, /* the match data block */ |
| NULL, /* a match context; NULL means use defaults */ |
| wspace, /* working space vector */ |
| 20); /* number of elements (NOT size in bytes) */ |
| |
| Option bits for pcre_dfa_match() |
| |
| The unused bits of the options argument for pcre2_dfa_match() must be |
| zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL, |
| PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, |
| PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, PCRE2_PARTIAL_SOFT, |
| PCRE2_DFA_SHORTEST, and PCRE2_DFA_RESTART. All but the last four of |
| these are exactly the same as for pcre2_match(), so their description |
| is not repeated here. |
| |
| PCRE2_PARTIAL_HARD |
| PCRE2_PARTIAL_SOFT |
| |
| These have the same general effect as they do for pcre2_match(), but |
| the details are slightly different. When PCRE2_PARTIAL_HARD is set for |
| pcre2_dfa_match(), it returns PCRE2_ERROR_PARTIAL if the end of the |
| subject is reached and there is still at least one matching possibility |
| that requires additional characters. This happens even if some complete |
| matches have already been found. When PCRE2_PARTIAL_SOFT is set, the |
| return code PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL |
| if the end of the subject is reached, there have been no complete |
| matches, but there is still at least one matching possibility. The por- |
| tion of the string that was inspected when the longest partial match |
| was found is set as the first matching string in both cases. There is a |
| more detailed discussion of partial and multi-segment matching, with |
| examples, in the pcre2partial documentation. |
| |
| PCRE2_DFA_SHORTEST |
| |
| Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm to |
| stop as soon as it has found one match. Because of the way the alterna- |
| tive algorithm works, this is necessarily the shortest possible match |
| at the first possible matching point in the subject string. |
| |
| PCRE2_DFA_RESTART |
| |
| When pcre2_dfa_match() returns a partial match, it is possible to call |
| it again, with additional subject characters, and have it continue with |
| the same match. The PCRE2_DFA_RESTART option requests this action; when |
| it is set, the workspace and wscount options must reference the same |
| vector as before because data about the match so far is left in them |
| after a partial match. There is more discussion of this facility in the |
| pcre2partial documentation. |
| |
| Successful returns from pcre2_dfa_match() |
| |
| When pcre2_dfa_match() succeeds, it may have matched more than one sub- |
| string in the subject. Note, however, that all the matches from one run |
| of the function start at the same point in the subject. The shorter |
| matches are all initial substrings of the longer matches. For example, |
| if the pattern |
| |
| <.*> |
| |
| is matched against the string |
| |
| This is <something> <something else> <something further> no more |
| |
| the three matched strings are |
| |
| <something> <something else> <something further> |
| <something> <something else> |
| <something> |
| |
| On success, the yield of the function is a number greater than zero, |
| which is the number of matched substrings. The offsets of the sub- |
| strings are returned in the ovector, and can be extracted by number in |
| the same way as for pcre2_match(), but the numbers bear no relation to |
| any capturing groups that may exist in the pattern, because DFA match- |
| ing does not support group capture. |
| |
| Calls to the convenience functions that extract substrings by name |
| return the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used |
| after a DFA match. The convenience functions that extract substrings by |
| number never return PCRE2_ERROR_NOSUBSTRING, and the meanings of some |
| other errors are slightly different: |
| |
| PCRE2_ERROR_UNAVAILABLE |
| |
| The ovector is not big enough to include a slot for the given substring |
| number. |
| |
| PCRE2_ERROR_UNSET |
| |
| There is a slot in the ovector for this substring, but there were |
| insufficient matches to fill it. |
| |
| The matched strings are stored in the ovector in reverse order of |
| length; that is, the longest matching string is first. If there were |
| too many matches to fit into the ovector, the yield of the function is |
| zero, and the vector is filled with the longest matches. |
| |
| NOTE: PCRE2's "auto-possessification" optimization usually applies to |
| character repeats at the end of a pattern (as well as internally). For |
| example, the pattern "a\d+" is compiled as if it were "a\d++". For DFA |
| matching, this means that only one possible match is found. If you |
| really do want multiple matches in such cases, either use an ungreedy |
| repeat auch as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when |
| compiling. |
| |
| Error returns from pcre2_dfa_match() |
| |
| The pcre2_dfa_match() function returns a negative number when it fails. |
| Many of the errors are the same as for pcre2_match(), as described |
| above. There are in addition the following errors that are specific to |
| pcre2_dfa_match(): |
| |
| PCRE2_ERROR_DFA_UITEM |
| |
| This return is given if pcre2_dfa_match() encounters an item in the |
| pattern that it does not support, for instance, the use of \C in a UTF |
| mode or a back reference. |
| |
| PCRE2_ERROR_DFA_UCOND |
| |
| This return is given if pcre2_dfa_match() encounters a condition item |
| that uses a back reference for the condition, or a test for recursion |
| in a specific group. These are not supported. |
| |
| PCRE2_ERROR_DFA_WSSIZE |
| |
| This return is given if pcre2_dfa_match() runs out of space in the |
| workspace vector. |
| |
| PCRE2_ERROR_DFA_RECURSE |
| |
| When a recursive subpattern is processed, the matching function calls |
| itself recursively, using private memory for the ovector and workspace. |
| This error is given if the internal ovector is not large enough. This |
| should be extremely rare, as a vector of size 1000 is used. |
| |
| PCRE2_ERROR_DFA_BADRESTART |
| |
| When pcre2_dfa_match() is called with the PCRE2_DFA_RESTART option, |
| some plausibility checks are made on the contents of the workspace, |
| which should contain data about the previous partial match. If any of |
| these checks fail, this error is given. |
| |
| |
| SEE ALSO |
| |
| pcre2build(3), pcre2callout(3), pcre2demo(3), pcre2matching(3), |
| pcre2partial(3), pcre2posix(3), pcre2sample(3), pcre2stack(3), |
| pcre2unicode(3). |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 16 December 2015 |
| Copyright (c) 1997-2015 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |
| PCRE2BUILD(3) Library Functions Manual PCRE2BUILD(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| |
| BUILDING PCRE2 |
| |
| PCRE2 is distributed with a configure script that can be used to build |
| the library in Unix-like environments using the applications known as |
| Autotools. Also in the distribution are files to support building using |
| CMake instead of configure. The text file README contains general |
| information about building with Autotools (some of which is repeated |
| below), and also has some comments about building on various operating |
| systems. There is a lot more information about building PCRE2 without |
| using Autotools (including information about using CMake and building |
| "by hand") in the text file called NON-AUTOTOOLS-BUILD. You should |
| consult this file as well as the README file if you are building in a |
| non-Unix-like environment. |
| |
| |
| PCRE2 BUILD-TIME OPTIONS |
| |
| The rest of this document describes the optional features of PCRE2 that |
| can be selected when the library is compiled. It assumes use of the |
| configure script, where the optional features are selected or dese- |
| lected by providing options to configure before running the make com- |
| mand. However, the same options can be selected in both Unix-like and |
| non-Unix-like environments if you are using CMake instead of configure |
| to build PCRE2. |
| |
| If you are not using Autotools or CMake, option selection can be done |
| by editing the config.h file, or by passing parameter settings to the |
| compiler, as described in NON-AUTOTOOLS-BUILD. |
| |
| The complete list of options for configure (which includes the standard |
| ones such as the selection of the installation directory) can be |
| obtained by running |
| |
| ./configure --help |
| |
| The following sections include descriptions of options whose names |
| begin with --enable or --disable. These settings specify changes to the |
| defaults for the configure command. Because of the way that configure |
| works, --enable and --disable always come in pairs, so the complemen- |
| tary option always exists as well, but as it specifies the default, it |
| is not described. |
| |
| |
| BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES |
| |
| By default, a library called libpcre2-8 is built, containing functions |
| that take string arguments contained in vectors of bytes, interpreted |
| either as single-byte characters, or UTF-8 strings. You can also build |
| two other libraries, called libpcre2-16 and libpcre2-32, which process |
| strings that are contained in vectors of 16-bit and 32-bit code units, |
| respectively. These can be interpreted either as single-unit characters |
| or UTF-16/UTF-32 strings. To build these additional libraries, add one |
| or both of the following to the configure command: |
| |
| --enable-pcre2-16 |
| --enable-pcre2-32 |
| |
| If you do not want the 8-bit library, add |
| |
| --disable-pcre2-8 |
| |
| as well. At least one of the three libraries must be built. Note that |
| the POSIX wrapper is for the 8-bit library only, and that pcre2grep is |
| an 8-bit program. Neither of these are built if you select only the |
| 16-bit or 32-bit libraries. |
| |
| |
| BUILDING SHARED AND STATIC LIBRARIES |
| |
| The Autotools PCRE2 building process uses libtool to build both shared |
| and static libraries by default. You can suppress an unwanted library |
| by adding one of |
| |
| --disable-shared |
| --disable-static |
| |
| to the configure command. |
| |
| |
| UNICODE AND UTF SUPPORT |
| |
| By default, PCRE2 is built with support for Unicode and UTF character |
| strings. To build it without Unicode support, add |
| |
| --disable-unicode |
| |
| to the configure command. This setting applies to all three libraries. |
| It is not possible to build one library with Unicode support, and |
| another without, in the same configuration. |
| |
| Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, |
| UTF-16 or UTF-32. To do that, applications that use the library can set |
| the PCRE2_UTF option when they call pcre2_compile() to compile a pat- |
| tern. Alternatively, patterns may be started with (*UTF) unless the |
| application has locked this out by setting PCRE2_NEVER_UTF. |
| |
| UTF support allows the libraries to process character code points up to |
| 0x10ffff in the strings that they handle. It also provides support for |
| accessing the Unicode properties of such characters, using pattern |
| escapes such as \P, \p, and \X. Only the general category properties |
| such as Lu and Nd are supported. Details are given in the pcre2pattern |
| documentation. |
| |
| Pattern escapes such as \d and \w do not by default make use of Unicode |
| properties. The application can request that they do by setting the |
| PCRE2_UCP option. Unless the application has set PCRE2_NEVER_UCP, a |
| pattern may also request this by starting with (*UCP). |
| |
| |
| DISABLING THE USE OF \C |
| |
| The \C escape sequence, which matches a single code unit, even in a UTF |
| mode, can cause unpredictable behaviour because it may leave the cur- |
| rent matching point in the middle of a multi-code-unit character. The |
| application can lock it out by setting the PCRE2_NEVER_BACKSLASH_C |
| option when calling pcre2_compile(). There is also a build-time option |
| |
| --enable-never-backslash-C |
| |
| (note the upper case C) which locks out the use of \C entirely. |
| |
| |
| JUST-IN-TIME COMPILER SUPPORT |
| |
| Just-in-time compiler support is included in the build by specifying |
| |
| --enable-jit |
| |
| This support is available only for certain hardware architectures. If |
| this option is set for an unsupported architecture, a building error |
| occurs. See the pcre2jit documentation for a discussion of JIT usage. |
| When JIT support is enabled, pcre2grep automatically makes use of it, |
| unless you add |
| |
| --disable-pcre2grep-jit |
| |
| to the "configure" command. |
| |
| |
| NEWLINE RECOGNITION |
| |
| By default, PCRE2 interprets the linefeed (LF) character as indicating |
| the end of a line. This is the normal newline character on Unix-like |
| systems. You can compile PCRE2 to use carriage return (CR) instead, by |
| adding |
| |
| --enable-newline-is-cr |
| |
| to the configure command. There is also an --enable-newline-is-lf |
| option, which explicitly specifies linefeed as the newline character. |
| |
| Alternatively, you can specify that line endings are to be indicated by |
| the two-character sequence CRLF (CR immediately followed by LF). If you |
| want this, add |
| |
| --enable-newline-is-crlf |
| |
| to the configure command. There is a fourth option, specified by |
| |
| --enable-newline-is-anycrlf |
| |
| which causes PCRE2 to recognize any of the three sequences CR, LF, or |
| CRLF as indicating a line ending. Finally, a fifth option, specified by |
| |
| --enable-newline-is-any |
| |
| causes PCRE2 to recognize any Unicode newline sequence. The Unicode |
| newline sequences are the three just mentioned, plus the single charac- |
| ters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line, |
| U+0085), LS (line separator, U+2028), and PS (paragraph separator, |
| U+2029). |
| |
| Whatever default line ending convention is selected when PCRE2 is built |
| can be overridden by applications that use the library. At build time |
| it is conventional to use the standard for your operating system. |
| |
| |
| WHAT \R MATCHES |
| |
| By default, the sequence \R in a pattern matches any Unicode newline |
| sequence, independently of what has been selected as the line ending |
| sequence. If you specify |
| |
| --enable-bsr-anycrlf |
| |
| the default is changed so that \R matches only CR, LF, or CRLF. What- |
| ever is selected when PCRE2 is built can be overridden by applications |
| that use the called. |
| |
| |
| HANDLING VERY LARGE PATTERNS |
| |
| Within a compiled pattern, offset values are used to point from one |
| part to another (for example, from an opening parenthesis to an alter- |
| nation metacharacter). By default, in the 8-bit and 16-bit libraries, |
| two-byte values are used for these offsets, leading to a maximum size |
| for a compiled pattern of around 64K code units. This is sufficient to |
| handle all but the most gigantic patterns. Nevertheless, some people do |
| want to process truly enormous patterns, so it is possible to compile |
| PCRE2 to use three-byte or four-byte offsets by adding a setting such |
| as |
| |
| --with-link-size=3 |
| |
| to the configure command. The value given must be 2, 3, or 4. For the |
| 16-bit library, a value of 3 is rounded up to 4. In these libraries, |
| using longer offsets slows down the operation of PCRE2 because it has |
| to load additional data when handling them. For the 32-bit library the |
| value is always 4 and cannot be overridden; the value of --with-link- |
| size is ignored. |
| |
| |
| AVOIDING EXCESSIVE STACK USAGE |
| |
| When matching with the pcre2_match() function, PCRE2 implements back- |
| tracking by making recursive calls to an internal function called |
| match(). In environments where the size of the stack is limited, this |
| can severely limit PCRE2's operation. (The Unix environment does not |
| usually suffer from this problem, but it may sometimes be necessary to |
| increase the maximum stack size. There is a discussion in the |
| pcre2stack documentation.) An alternative approach to recursion that |
| uses memory from the heap to remember data, instead of using recursive |
| function calls, has been implemented to work round the problem of lim- |
| ited stack size. If you want to build a version of PCRE2 that works |
| this way, add |
| |
| --disable-stack-for-recursion |
| |
| to the configure command. By default, the system functions malloc() and |
| free() are called to manage the heap memory that is required, but cus- |
| tom memory management functions can be called instead. PCRE2 runs |
| noticeably more slowly when built in this way. This option affects only |
| the pcre2_match() function; it is not relevant for pcre2_dfa_match(). |
| |
| |
| LIMITING PCRE2 RESOURCE USAGE |
| |
| Internally, PCRE2 has a function called match(), which it calls repeat- |
| edly (sometimes recursively) when matching a pattern with the |
| pcre2_match() function. By controlling the maximum number of times this |
| function may be called during a single matching operation, a limit can |
| be placed on the resources used by a single call to pcre2_match(). The |
| limit can be changed at run time, as described in the pcre2api documen- |
| tation. The default is 10 million, but this can be changed by adding a |
| setting such as |
| |
| --with-match-limit=500000 |
| |
| to the configure command. This setting has no effect on the |
| pcre2_dfa_match() matching function. |
| |
| In some environments it is desirable to limit the depth of recursive |
| calls of match() more strictly than the total number of calls, in order |
| to restrict the maximum amount of stack (or heap, if --disable-stack- |
| for-recursion is specified) that is used. A second limit controls this; |
| it defaults to the value that is set for --with-match-limit, which |
| imposes no additional constraints. However, you can set a lower limit |
| by adding, for example, |
| |
| --with-match-limit-recursion=10000 |
| |
| to the configure command. This value can also be overridden at run |
| time. |
| |
| |
| CREATING CHARACTER TABLES AT BUILD TIME |
| |
| PCRE2 uses fixed tables for processing characters whose code points are |
| less than 256. By default, PCRE2 is built with a set of tables that are |
| distributed in the file src/pcre2_chartables.c.dist. These tables are |
| for ASCII codes only. If you add |
| |
| --enable-rebuild-chartables |
| |
| to the configure command, the distributed tables are no longer used. |
| Instead, a program called dftables is compiled and run. This outputs |
| the source for new set of tables, created in the default locale of your |
| C run-time system. (This method of replacing the tables does not work |
| if you are cross compiling, because dftables is run on the local host. |
| If you need to create alternative tables when cross compiling, you will |
| have to do so "by hand".) |
| |
| |
| USING EBCDIC CODE |
| |
| PCRE2 assumes by default that it will run in an environment where the |
| character code is ASCII or Unicode, which is a superset of ASCII. This |
| is the case for most computer operating systems. PCRE2 can, however, be |
| compiled to run in an 8-bit EBCDIC environment by adding |
| |
| --enable-ebcdic --disable-unicode |
| |
| to the configure command. This setting implies --enable-rebuild-charta- |
| bles. You should only use it if you know that you are in an EBCDIC |
| environment (for example, an IBM mainframe operating system). |
| |
| It is not possible to support both EBCDIC and UTF-8 codes in the same |
| version of the library. Consequently, --enable-unicode and --enable- |
| ebcdic are mutually exclusive. |
| |
| The EBCDIC character that corresponds to an ASCII LF is assumed to have |
| the value 0x15 by default. However, in some EBCDIC environments, 0x25 |
| is used. In such an environment you should use |
| |
| --enable-ebcdic-nl25 |
| |
| as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR |
| has the same value as in ASCII, namely, 0x0d. Whichever of 0x15 and |
| 0x25 is not chosen as LF is made to correspond to the Unicode NEL char- |
| acter (which, in Unicode, is 0x85). |
| |
| The options that select newline behaviour, such as --enable-newline-is- |
| cr, and equivalent run-time options, refer to these character values in |
| an EBCDIC environment. |
| |
| |
| PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT |
| |
| By default, pcre2grep reads all files as plain text. You can build it |
| so that it recognizes files whose names end in .gz or .bz2, and reads |
| them with libz or libbz2, respectively, by adding one or both of |
| |
| --enable-pcre2grep-libz |
| --enable-pcre2grep-libbz2 |
| |
| to the configure command. These options naturally require that the rel- |
| evant libraries are installed on your system. Configuration will fail |
| if they are not. |
| |
| |
| PCRE2GREP BUFFER SIZE |
| |
| pcre2grep uses an internal buffer to hold a "window" on the file it is |
| scanning, in order to be able to output "before" and "after" lines when |
| it finds a match. The size of the buffer is controlled by a parameter |
| whose default value is 20K. The buffer itself is three times this size, |
| but because of the way it is used for holding "before" lines, the long- |
| est line that is guaranteed to be processable is the parameter size. |
| You can change the default parameter value by adding, for example, |
| |
| --with-pcre2grep-bufsize=50K |
| |
| to the configure command. The caller of pcre2grep can override this |
| value by using --buffer-size on the command line.. |
| |
| |
| PCRE2TEST OPTION FOR LIBREADLINE SUPPORT |
| |
| If you add one of |
| |
| --enable-pcre2test-libreadline |
| --enable-pcre2test-libedit |
| |
| to the configure command, pcre2test is linked with the libreadline |
| orlibedit library, respectively, and when its input is from a terminal, |
| it reads it using the readline() function. This provides line-editing |
| and history facilities. Note that libreadline is GPL-licensed, so if |
| you distribute a binary of pcre2test linked in this way, there may be |
| licensing issues. These can be avoided by linking instead with libedit, |
| which has a BSD licence. |
| |
| Setting --enable-pcre2test-libreadline causes the -lreadline option to |
| be added to the pcre2test build. In many operating environments with a |
| sytem-installed readline library this is sufficient. However, in some |
| environments (e.g. if an unmodified distribution version of readline is |
| in use), some extra configuration may be necessary. The INSTALL file |
| for libreadline says this: |
| |
| "Readline uses the termcap functions, but does not link with |
| the termcap or curses library itself, allowing applications |
| which link with readline the to choose an appropriate library." |
| |
| If your environment has not been set up so that an appropriate library |
| is automatically included, you may need to add something like |
| |
| LIBS="-ncurses" |
| |
| immediately before the configure command. |
| |
| |
| INCLUDING DEBUGGING CODE |
| |
| If you add |
| |
| --enable-debug |
| |
| to the configure command, additional debugging code is included in the |
| build. This feature is intended for use by the PCRE2 maintainers. |
| |
| |
| DEBUGGING WITH VALGRIND SUPPORT |
| |
| If you add |
| |
| --enable-valgrind |
| |
| to the configure command, PCRE2 will use valgrind annotations to mark |
| certain memory regions as unaddressable. This allows it to detect |
| invalid memory accesses, and is mostly useful for debugging PCRE2 |
| itself. |
| |
| |
| CODE COVERAGE REPORTING |
| |
| If your C compiler is gcc, you can build a version of PCRE2 that can |
| generate a code coverage report for its test suite. To enable this, you |
| must install lcov version 1.6 or above. Then specify |
| |
| --enable-coverage |
| |
| to the configure command and build PCRE2 in the usual way. |
| |
| Note that using ccache (a caching C compiler) is incompatible with code |
| coverage reporting. If you have configured ccache to run automatically |
| on your system, you must set the environment variable |
| |
| CCACHE_DISABLE=1 |
| |
| before running make to build PCRE2, so that ccache is not used. |
| |
| When --enable-coverage is used, the following addition targets are |
| added to the Makefile: |
| |
| make coverage |
| |
| This creates a fresh coverage report for the PCRE2 test suite. It is |
| equivalent to running "make coverage-reset", "make coverage-baseline", |
| "make check", and then "make coverage-report". |
| |
| make coverage-reset |
| |
| This zeroes the coverage counters, but does nothing else. |
| |
| make coverage-baseline |
| |
| This captures baseline coverage information. |
| |
| make coverage-report |
| |
| This creates the coverage report. |
| |
| make coverage-clean-report |
| |
| This removes the generated coverage report without cleaning the cover- |
| age data itself. |
| |
| make coverage-clean-data |
| |
| This removes the captured coverage data without removing the coverage |
| files created at compile time (*.gcno). |
| |
| make coverage-clean |
| |
| This cleans all coverage data including the generated coverage report. |
| For more information about code coverage, see the gcov and lcov docu- |
| mentation. |
| |
| |
| SEE ALSO |
| |
| pcre2api(3), pcre2-config(3). |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 16 October 2015 |
| Copyright (c) 1997-2015 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |
| PCRE2CALLOUT(3) Library Functions Manual PCRE2CALLOUT(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| |
| SYNOPSIS |
| |
| #include <pcre2.h> |
| |
| int (*pcre2_callout)(pcre2_callout_block *, void *); |
| |
| int pcre2_callout_enumerate(const pcre2_code *code, |
| int (*callback)(pcre2_callout_enumerate_block *, void *), |
| void *user_data); |
| |
| |
| DESCRIPTION |
| |
| PCRE2 provides a feature called "callout", which is a means of tempo- |
| rarily passing control to the caller of PCRE2 in the middle of pattern |
| matching. The caller of PCRE2 provides an external function by putting |
| its entry point in a match context (see pcre2_set_callout() in the |
| pcre2api documentation). |
| |
| Within a regular expression, (?C<arg>) indicates a point at which the |
| external function is to be called. Different callout points can be |
| identified by putting a number less than 256 after the letter C. The |
| default value is zero. Alternatively, the argument may be a delimited |
| string. The starting delimiter must be one of ` ' " ^ % # $ { and the |
| ending delimiter is the same as the start, except for {, where the end- |
| ing delimiter is }. If the ending delimiter is needed within the |
| string, it must be doubled. For example, this pattern has two callout |
| points: |
| |
| (?C1)abc(?C"some ""arbitrary"" text")def |
| |
| If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled, |
| PCRE2 automatically inserts callouts, all with number 255, before each |
| item in the pattern. For example, if PCRE2_AUTO_CALLOUT is used with |
| the pattern |
| |
| A(\d{2}|--) |
| |
| it is processed as if it were |
| |
| (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255) |
| |
| Notice that there is a callout before and after each parenthesis and |
| alternation bar. If the pattern contains a conditional group whose con- |
| dition is an assertion, an automatic callout is inserted immediately |
| before the condition. Such a callout may also be inserted explicitly, |
| for example: |
| |
| (?(?C9)(?=a)ab|de) (?(?C%text%)(?!=d)ab|de) |
| |
| This applies only to assertion conditions (because they are themselves |
| independent groups). |
| |
| Callouts can be useful for tracking the progress of pattern matching. |
| The pcre2test program has a pattern qualifier (/auto_callout) that sets |
| automatic callouts. When any callouts are present, the output from |
| pcre2test indicates how the pattern is being matched. This is useful |
| information when you are trying to optimize the performance of a par- |
| ticular pattern. |
| |
| |
| MISSING CALLOUTS |
| |
| You should be aware that, because of optimizations in the way PCRE2 |
| compiles and matches patterns, callouts sometimes do not happen exactly |
| as you might expect. |
| |
| Auto-possessification |
| |
| At compile time, PCRE2 "auto-possessifies" repeated items when it knows |
| that what follows cannot be part of the repeat. For example, a+[bc] is |
| compiled as if it were a++[bc]. The pcre2test output when this pattern |
| is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied |
| to the string "aaaa" is: |
| |
| --->aaaa |
| +0 ^ a+ |
| +2 ^ ^ [bc] |
| No match |
| |
| This indicates that when matching [bc] fails, there is no backtracking |
| into a+ and therefore the callouts that would be taken for the back- |
| tracks do not occur. You can disable the auto-possessify feature by |
| passing PCRE2_NO_AUTO_POSSESS to pcre2_compile(), or starting the pat- |
| tern with (*NO_AUTO_POSSESS). In this case, the output changes to this: |
| |
| --->aaaa |
| +0 ^ a+ |
| +2 ^ ^ [bc] |
| +2 ^ ^ [bc] |
| +2 ^ ^ [bc] |
| +2 ^^ [bc] |
| No match |
| |
| This time, when matching [bc] fails, the matcher backtracks into a+ and |
| tries again, repeatedly, until a+ itself fails. |
| |
| Automatic .* anchoring |
| |
| By default, an optimization is applied when .* is the first significant |
| item in a pattern. If PCRE2_DOTALL is set, so that the dot can match |
| any character, the pattern is automatically anchored. If PCRE2_DOTALL |
| is not set, a match can start only after an internal newline or at the |
| beginning of the subject, and pcre2_compile() remembers this. This |
| optimization is disabled, however, if .* is in an atomic group or if |
| there is a back reference to the capturing group in which it appears. |
| It is also disabled if the pattern contains (*PRUNE) or (*SKIP). How- |
| ever, the presence of callouts does not affect it. |
| |
| For example, if the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT |
| and applied to the string "aa", the pcre2test output is: |
| |
| --->aa |
| +0 ^ .* |
| +2 ^ ^ \d |
| +2 ^^ \d |
| +2 ^ \d |
| No match |
| |
| This shows that all match attempts start at the beginning of the sub- |
| ject. In other words, the pattern is anchored. You can disable this |
| optimization by passing PCRE2_NO_DOTSTAR_ANCHOR to pcre2_compile(), or |
| starting the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the out- |
| put changes to: |
| |
| --->aa |
| +0 ^ .* |
| +2 ^ ^ \d |
| +2 ^^ \d |
| +2 ^ \d |
| +0 ^ .* |
| +2 ^^ \d |
| +2 ^ \d |
| No match |
| |
| This shows more match attempts, starting at the second subject charac- |
| ter. Another optimization, described in the next section, means that |
| there is no subsequent attempt to match with an empty subject. |
| |
| If a pattern has more than one top-level branch, automatic anchoring |
| occurs if all branches are anchorable. |
| |
| Other optimizations |
| |
| Other optimizations that provide fast "no match" results also affect |
| callouts. For example, if the pattern is |
| |
| ab(?C4)cd |
| |
| PCRE2 knows that any matching string must contain the letter "d". If |
| the subject string is "abyz", the lack of "d" means that matching |
| doesn't ever start, and the callout is never reached. However, with |
| "abyd", though the result is still no match, the callout is obeyed. |
| |
| PCRE2 also knows the minimum length of a matching string, and will |
| immediately give a "no match" return without actually running a match |
| if the subject is not long enough, or, for unanchored patterns, if it |
| has been scanned far enough. |
| |
| You can disable these optimizations by passing the PCRE2_NO_START_OPTI- |
| MIZE option to pcre2_compile(), or by starting the pattern with |
| (*NO_START_OPT). This slows down the matching process, but does ensure |
| that callouts such as the example above are obeyed. |
| |
| |
| THE CALLOUT INTERFACE |
| |
| During matching, when PCRE2 reaches a callout point, if an external |
| function is set in the match context, it is called. This applies to |
| both normal and DFA matching. The first argument to the callout func- |
| tion is a pointer to a pcre2_callout block. The second argument is the |
| void * callout data that was supplied when the callout was set up by |
| calling pcre2_set_callout() (see the pcre2api documentation). The call- |
| out block structure contains the following fields: |
| |
| uint32_t version; |
| uint32_t callout_number; |
| uint32_t capture_top; |
| uint32_t capture_last; |
| PCRE2_SIZE *offset_vector; |
| PCRE2_SPTR mark; |
| PCRE2_SPTR subject; |
| PCRE2_SIZE subject_length; |
| PCRE2_SIZE start_match; |
| PCRE2_SIZE current_position; |
| PCRE2_SIZE pattern_position; |
| PCRE2_SIZE next_item_length; |
| PCRE2_SIZE callout_string_offset; |
| PCRE2_SIZE callout_string_length; |
| PCRE2_SPTR callout_string; |
| |
| The version field contains the version number of the block format. The |
| current version is 1; the three callout string fields were added for |
| this version. If you are writing an application that might use an ear- |
| lier release of PCRE2, you should check the version number before |
| accessing any of these fields. The version number will increase in |
| future if more fields are added, but the intention is never to remove |
| any of the existing fields. |
| |
| Fields for numerical callouts |
| |
| For a numerical callout, callout_string is NULL, and callout_number |
| contains the number of the callout, in the range 0-255. This is the |
| number that follows (?C for manual callouts; it is 255 for automati- |
| cally generated callouts. |
| |
| Fields for string callouts |
| |
| For callouts with string arguments, callout_number is always zero, and |
| callout_string points to the string that is contained within the com- |
| piled pattern. Its length is given by callout_string_length. Duplicated |
| ending delimiters that were present in the original pattern string have |
| been turned into single characters, but there is no other processing of |
| the callout string argument. An additional code unit containing binary |
| zero is present after the string, but is not included in the length. |
| The delimiter that was used to start the string is also stored within |
| the pattern, immediately before the string itself. You can access this |
| delimiter as callout_string[-1] if you need it. |
| |
| The callout_string_offset field is the code unit offset to the start of |
| the callout argument string within the original pattern string. This is |
| provided for the benefit of applications such as script languages that |
| might need to report errors in the callout string within the pattern. |
| |
| Fields for all callouts |
| |
| The remaining fields in the callout block are the same for both kinds |
| of callout. |
| |
| The offset_vector field is a pointer to the vector of capturing offsets |
| (the "ovector") that was passed to the matching function in the match |
| data block. When pcre2_match() is used, the contents can be inspected |
| in order to extract substrings that have been matched so far, in the |
| same way as for extracting substrings after a match has completed. For |
| the DFA matching function, this field is not useful. |
| |
| The subject and subject_length fields contain copies of the values that |
| were passed to the matching function. |
| |
| The start_match field normally contains the offset within the subject |
| at which the current match attempt started. However, if the escape |
| sequence \K has been encountered, this value is changed to reflect the |
| modified starting point. If the pattern is not anchored, the callout |
| function may be called several times from the same point in the pattern |
| for different starting points in the subject. |
| |
| The current_position field contains the offset within the subject of |
| the current match pointer. |
| |
| When the pcre2_match() is used, the capture_top field contains one more |
| than the number of the highest numbered captured substring so far. If |
| no substrings have been captured, the value of capture_top is one. This |
| is always the case when the DFA functions are used, because they do not |
| support captured substrings. |
| |
| The capture_last field contains the number of the most recently cap- |
| tured substring. However, when a recursion exits, the value reverts to |
| what it was outside the recursion, as do the values of all captured |
| substrings. If no substrings have been captured, the value of cap- |
| ture_last is 0. This is always the case for the DFA matching functions. |
| |
| The pattern_position field contains the offset in the pattern string to |
| the next item to be matched. |
| |
| The next_item_length field contains the length of the next item to be |
| matched in the pattern string. When the callout immediately precedes an |
| alternation bar, a closing parenthesis, or the end of the pattern, the |
| length is zero. When the callout precedes an opening parenthesis, the |
| length is that of the entire subpattern. |
| |
| The pattern_position and next_item_length fields are intended to help |
| in distinguishing between different automatic callouts, which all have |
| the same callout number. However, they are set for all callouts, and |
| are used by pcre2test to show the next item to be matched when display- |
| ing callout information. |
| |
| In callouts from pcre2_match() the mark field contains a pointer to the |
| zero-terminated name of the most recently passed (*MARK), (*PRUNE), or |
| (*THEN) item in the match, or NULL if no such items have been passed. |
| Instances of (*PRUNE) or (*THEN) without a name do not obliterate a |
| previous (*MARK). In callouts from the DFA matching function this field |
| always contains NULL. |
| |
| |
| RETURN VALUES FROM CALLOUTS |
| |
| The external callout function returns an integer to PCRE2. If the value |
| is zero, matching proceeds as normal. If the value is greater than |
| zero, matching fails at the current point, but the testing of other |
| matching possibilities goes ahead, just as if a lookahead assertion had |
| failed. If the value is less than zero, the match is abandoned, and the |
| matching function returns the negative value. |
| |
| Negative values should normally be chosen from the set of |
| PCRE2_ERROR_xxx values. In particular, PCRE2_ERROR_NOMATCH forces a |
| standard "no match" failure. The error number PCRE2_ERROR_CALLOUT is |
| reserved for use by callout functions; it will never be used by PCRE2 |
| itself. |
| |
| |
| CALLOUT ENUMERATION |
| |
| int pcre2_callout_enumerate(const pcre2_code *code, |
| int (*callback)(pcre2_callout_enumerate_block *, void *), |
| void *user_data); |
| |
| A script language that supports the use of string arguments in callouts |
| might like to scan all the callouts in a pattern before running the |
| match. This can be done by calling pcre2_callout_enumerate(). The first |
| argument is a pointer to a compiled pattern, the second points to a |
| callback function, and the third is arbitrary user data. The callback |
| function is called for every callout in the pattern in the order in |
| which they appear. Its first argument is a pointer to a callout enumer- |
| ation block, and its second argument is the user_data value that was |
| passed to pcre2_callout_enumerate(). The data block contains the fol- |
| lowing fields: |
| |
| version Block version number |
| pattern_position Offset to next item in pattern |
| next_item_length Length of next item in pattern |
| callout_number Number for numbered callouts |
| callout_string_offset Offset to string within pattern |
| callout_string_length Length of callout string |
| callout_string Points to callout string or is NULL |
| |
| The version number is currently 0. It will increase if new fields are |
| ever added to the block. The remaining fields are the same as their |
| namesakes in the pcre2_callout block that is used for callouts during |
| matching, as described above. |
| |
| Note that the value of pattern_position is unique for each callout. |
| However, if a callout occurs inside a group that is quantified with a |
| non-zero minimum or a fixed maximum, the group is replicated inside the |
| compiled pattern. For example, a pattern such as /(a){2}/ is compiled |
| as if it were /(a)(a)/. This means that the callout will be enumerated |
| more than once, but with the same value for pattern_position in each |
| case. |
| |
| The callback function should normally return zero. If it returns a non- |
| zero value, scanning the pattern stops, and that value is returned from |
| pcre2_callout_enumerate(). |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 23 March 2015 |
| Copyright (c) 1997-2015 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |
| PCRE2COMPAT(3) Library Functions Manual PCRE2COMPAT(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| |
| DIFFERENCES BETWEEN PCRE2 AND PERL |
| |
| This document describes the differences in the ways that PCRE2 and Perl |
| handle regular expressions. The differences described here are with |
| respect to Perl versions 5.10 and above. |
| |
| 1. PCRE2 has only a subset of Perl's Unicode support. Details of what |
| it does have are given in the pcre2unicode page. |
| |
| 2. PCRE2 allows repeat quantifiers only on parenthesized assertions, |
| but they do not mean what you might think. For example, (?!a){3} does |
| not assert that the next three characters are not "a". It just asserts |
| that the next character is not "a" three times (in principle: PCRE2 |
| optimizes this to run the assertion just once). Perl allows repeat |
| quantifiers on other assertions such as \b, but these do not seem to |
| have any use. |
| |
| 3. Capturing subpatterns that occur inside negative lookahead asser- |
| tions are counted, but their entries in the offsets vector are never |
| set. Perl sometimes (but not always) sets its numerical variables from |
| inside negative assertions. |
| |
| 4. The following Perl escape sequences are not supported: \l, \u, \L, |
| \U, and \N when followed by a character name or Unicode value. (\N on |
| its own, matching a non-newline character, is supported.) In fact these |
| are implemented by Perl's general string-handling and are not part of |
| its pattern matching engine. If any of these are encountered by PCRE2, |
| an error is generated by default. However, if the PCRE2_ALT_BSUX option |
| is set, \U and \u are interpreted as ECMAScript interprets them. |
| |
| 5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 |
| is built with Unicode support. The properties that can be tested with |
| \p and \P are limited to the general category properties such as Lu and |
| Nd, script names such as Greek or Han, and the derived properties Any |
| and L&. PCRE2 does support the Cs (surrogate) property, which Perl does |
| not; the Perl documentation says "Because Perl hides the need for the |
| user to understand the internal representation of Unicode characters, |
| there is no need to implement the somewhat messy concept of surro- |
| gates." |
| |
| 6. PCRE2 does support the \Q...\E escape for quoting substrings. Char- |
| acters in between are treated as literals. This is slightly different |
| from Perl in that $ and @ are also handled as literals inside the |
| quotes. In Perl, they cause variable interpolation (but of course PCRE2 |
| does not have variables). Note the following examples: |
| |
| Pattern PCRE2 matches Perl matches |
| |
| \Qabc$xyz\E abc$xyz abc followed by the |
| contents of $xyz |
| \Qabc\$xyz\E abc\$xyz abc\$xyz |
| \Qabc\E\$\Qxyz\E abc$xyz abc$xyz |
| |
| The \Q...\E sequence is recognized both inside and outside character |
| classes. |
| |
| 7. Fairly obviously, PCRE2 does not support the (?{code}) and |
| (??{code}) constructions. However, there is support for recursive pat- |
| terns. This is not available in Perl 5.8, but it is in Perl 5.10. Also, |
| the PCRE2 "callout" feature allows an external function to be called |
| during pattern matching. See the pcre2callout documentation for |
| details. |
| |
| 8. Subroutine calls (whether recursive or not) are treated as atomic |
| groups. Atomic recursion is like Python, but unlike Perl. Captured |
| values that are set outside a subroutine call can be referenced from |
| inside in PCRE2, but not in Perl. There is a discussion that explains |
| these differences in more detail in the section on recursion differ- |
| ences from Perl in the pcre2pattern page. |
| |
| 9. If any of the backtracking control verbs are used in a subpattern |
| that is called as a subroutine (whether or not recursively), their |
| effect is confined to that subpattern; it does not extend to the sur- |
| rounding pattern. This is not always the case in Perl. In particular, |
| if (*THEN) is present in a group that is called as a subroutine, its |
| action is limited to that group, even if the group does not contain any |
| | characters. Note that such subpatterns are processed as anchored at |
| the point where they are tested. |
| |
| 10. If a pattern contains more than one backtracking control verb, the |
| first one that is backtracked onto acts. For example, in the pattern |
| A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure |
| in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases |
| it is the same as PCRE2, but there are examples where it differs. |
| |
| 11. Most backtracking verbs in assertions have their normal actions. |
| They are not confined to the assertion. |
| |
| 12. There are some differences that are concerned with the settings of |
| captured strings when part of a pattern is repeated. For example, |
| matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2 |
| unset, but in PCRE2 it is set to "b". |
| |
| 13. PCRE2's handling of duplicate subpattern numbers and duplicate sub- |
| pattern names is not as general as Perl's. This is a consequence of the |
| fact the PCRE2 works internally just with numbers, using an external |
| table to translate between numbers and names. In particular, a pattern |
| such as (?|(?<a>A)|(?<b)B), where the two capturing parentheses have |
| the same number but different names, is not supported, and causes an |
| error at compile time. If it were allowed, it would not be possible to |
| distinguish which parentheses matched, because both names map to cap- |
| turing subpattern number 1. To avoid this confusing situation, an error |
| is given at compile time. |
| |
| 14. Perl recognizes comments in some places that PCRE2 does not, for |
| example, between the ( and ? at the start of a subpattern. If the /x |
| modifier is set, Perl allows white space between ( and ? (though cur- |
| rent Perls warn that this is deprecated) but PCRE2 never does, even if |
| the PCRE2_EXTENDED option is set. |
| |
| 15. Perl, when in warning mode, gives warnings for character classes |
| such as [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter- |
| als. PCRE2 has no warning features, so it gives an error in these cases |
| because they are almost certainly user mistakes. |
| |
| 16. In PCRE2, the upper/lower case character properties Lu and Ll are |
| not affected when case-independent matching is specified. For example, |
| \p{Lu} always matches an upper case letter. I think Perl has changed in |
| this respect; in the release at the time of writing (5.16), \p{Lu} and |
| \p{Ll} match all letters, regardless of case, when case independence is |
| specified. |
| |
| 17. PCRE2 provides some extensions to the Perl regular expression |
| facilities. Perl 5.10 includes new features that are not in earlier |
| versions of Perl, some of which (such as named parentheses) have been |
| in PCRE2 for some time. This list is with respect to Perl 5.10: |
| |
| (a) Although lookbehind assertions in PCRE2 must match fixed length |
| strings, each alternative branch of a lookbehind assertion can match a |
| different length of string. Perl requires them all to have the same |
| length. |
| |
| (b) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the |
| $ meta-character matches only at the very end of the string. |
| |
| (c) A backslash followed by a letter with no special meaning is |
| faulted. (Perl can be made to issue a warning.) |
| |
| (d) If PCRE2_UNGREEDY is set, the greediness of the repetition quanti- |
| fiers is inverted, that is, by default they are not greedy, but if fol- |
| lowed by a question mark they are. |
| |
| (e) PCRE2_ANCHORED can be used at matching time to force a pattern to |
| be tried only at the first matching position in the subject string. |
| |
| (f) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, |
| PCRE2_NOTEMPTY_ATSTART, and PCRE2_NO_AUTO_CAPTURE options have no Perl |
| equivalents. |
| |
| (g) The \R escape sequence can be restricted to match only CR, LF, or |
| CRLF by the PCRE2_BSR_ANYCRLF option. |
| |
| (h) The callout facility is PCRE2-specific. |
| |
| (i) The partial matching facility is PCRE2-specific. |
| |
| (j) The alternative matching function (pcre2_dfa_match() matches in a |
| different way and is not Perl-compatible. |
| |
| (k) PCRE2 recognizes some special sequences such as (*CR) at the start |
| of a pattern that set overall options that cannot be changed within the |
| pattern. |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 15 March 2015 |
| Copyright (c) 1997-2015 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |
| PCRE2JIT(3) Library Functions Manual PCRE2JIT(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| |
| PCRE2 JUST-IN-TIME COMPILER SUPPORT |
| |
| Just-in-time compiling is a heavyweight optimization that can greatly |
| speed up pattern matching. However, it comes at the cost of extra pro- |
| cessing before the match is performed, so it is of most benefit when |
| the same pattern is going to be matched many times. This does not nec- |
| essarily mean many calls of a matching function; if the pattern is not |
| anchored, matching attempts may take place many times at various posi- |
| tions in the subject, even for a single call. Therefore, if the subject |
| string is very long, it may still pay to use JIT even for one-off |
| matches. JIT support is available for all of the 8-bit, 16-bit and |
| 32-bit PCRE2 libraries. |
| |
| JIT support applies only to the traditional Perl-compatible matching |
| function. It does not apply when the DFA matching function is being |
| used. The code for this support was written by Zoltan Herczeg. |
| |
| |
| AVAILABILITY OF JIT SUPPORT |
| |
| JIT support is an optional feature of PCRE2. The "configure" option |
| --enable-jit (or equivalent CMake option) must be set when PCRE2 is |
| built if you want to use JIT. The support is limited to the following |
| hardware platforms: |
| |
| ARM 32-bit (v5, v7, and Thumb2) |
| ARM 64-bit |
| Intel x86 32-bit and 64-bit |
| MIPS 32-bit and 64-bit |
| Power PC 32-bit and 64-bit |
| SPARC 32-bit |
| |
| If --enable-jit is set on an unsupported platform, compilation fails. |
| |
| A program can tell if JIT support is available by calling pcre2_con- |
| fig() with the PCRE2_CONFIG_JIT option. The result is 1 when JIT is |
| available, and 0 otherwise. However, a simple program does not need to |
| check this in order to use JIT. The API is implemented in a way that |
| falls back to the interpretive code if JIT is not available. For pro- |
| grams that need the best possible performance, there is also a "fast |
| path" API that is JIT-specific. |
| |
| |
| SIMPLE USE OF JIT |
| |
| To make use of the JIT support in the simplest way, all you have to do |
| is to call pcre2_jit_compile() after successfully compiling a pattern |
| with pcre2_compile(). This function has two arguments: the first is the |
| compiled pattern pointer that was returned by pcre2_compile(), and the |
| second is zero or more of the following option bits: PCRE2_JIT_COM- |
| PLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT. |
| |
| If JIT support is not available, a call to pcre2_jit_compile() does |
| nothing and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled |
| pattern is passed to the JIT compiler, which turns it into machine code |
| that executes much faster than the normal interpretive code, but yields |
| exactly the same results. The returned value from pcre2_jit_compile() |
| is zero on success, or a negative error code. |
| |
| There is a limit to the size of pattern that JIT supports, imposed by |
| the size of machine stack that it uses. The exact rules are not docu- |
| mented because they may change at any time, in particular, when new |
| optimizations are introduced. If a pattern is too big, a call to |
| pcre2_jit_compile() returns PCRE2_ERROR_NOMEMORY. |
| |
| PCRE2_JIT_COMPLETE requests the JIT compiler to generate code for com- |
| plete matches. If you want to run partial matches using the PCRE2_PAR- |
| TIAL_HARD or PCRE2_PARTIAL_SOFT options of pcre2_match(), you should |
| set one or both of the other options as well as, or instead of |
| PCRE2_JIT_COMPLETE. The JIT compiler generates different optimized code |
| for each of the three modes (normal, soft partial, hard partial). When |
| pcre2_match() is called, the appropriate code is run if it is avail- |
| able. Otherwise, the pattern is matched using interpretive code. |
| |
| You can call pcre2_jit_compile() multiple times for the same compiled |
| pattern. It does nothing if it has previously compiled code for any of |
| the option bits. For example, you can call it once with PCRE2_JIT_COM- |
| PLETE and (perhaps later, when you find you need partial matching) |
| again with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it |
| will ignore PCRE2_JIT_COMPLETE and just compile code for partial match- |
| ing. If pcre2_jit_compile() is called with no option bits set, it imme- |
| diately returns zero. This is an alternative way of testing whether JIT |
| is available. |
| |
| At present, it is not possible to free JIT compiled code except when |
| the entire compiled pattern is freed by calling pcre2_code_free(). |
| |
| In some circumstances you may need to call additional functions. These |
| are described in the section entitled "Controlling the JIT stack" |
| below. |
| |
| There are some pcre2_match() options that are not supported by JIT, and |
| there are also some pattern items that JIT cannot handle. Details are |
| given below. In both cases, matching automatically falls back to the |
| interpretive code. If you want to know whether JIT was actually used |
| for a particular match, you should arrange for a JIT callback function |
| to be set up as described in the section entitled "Controlling the JIT |
| stack" below, even if you do not need to supply a non-default JIT |
| stack. Such a callback function is called whenever JIT code is about to |
| be obeyed. If the match-time options are not right for JIT execution, |
| the callback function is not obeyed. |
| |
| If the JIT compiler finds an unsupported item, no JIT data is gener- |
| ated. You can find out if JIT matching is available after compiling a |
| pattern by calling pcre2_pattern_info() with the PCRE2_INFO_JITSIZE |
| option. A non-zero result means that JIT compilation was successful. A |
| result of 0 means that JIT support is not available, or the pattern was |
| not processed by pcre2_jit_compile(), or the JIT compiler was not able |
| to handle the pattern. |
| |
| |
| UNSUPPORTED OPTIONS AND PATTERN ITEMS |
| |
| The pcre2_match() options that are supported for JIT matching are |
| PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, |
| PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. The |
| PCRE2_ANCHORED option is not supported at match time. |
| |
| The only unsupported pattern items are \C (match a single data unit) |
| when running in a UTF mode, and a callout immediately before an asser- |
| tion condition in a conditional group. |
| |
| |
| RETURN VALUES FROM JIT MATCHING |
| |
| When a pattern is matched using JIT matching, the return values are the |
| same as those given by the interpretive pcre2_match() code, with the |
| addition of one new error code: PCRE2_ERROR_JIT_STACKLIMIT. This means |
| that the memory used for the JIT stack was insufficient. See "Control- |
| ling the JIT stack" below for a discussion of JIT stack usage. |
| |
| The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if |
| searching a very large pattern tree goes on for too long, as it is in |
| the same circumstance when JIT is not used, but the details of exactly |
| what is counted are not the same. The PCRE2_ERROR_RECURSIONLIMIT error |
| code is never returned when JIT matching is used. |
| |
| |
| CONTROLLING THE JIT STACK |
| |
| When the compiled JIT code runs, it needs a block of memory to use as a |
| stack. By default, it uses 32K on the machine stack. However, some |
| large or complicated patterns need more than this. The error |
| PCRE2_ERROR_JIT_STACKLIMIT is given when there is not enough stack. |
| Three functions are provided for managing blocks of memory for use as |
| JIT stacks. There is further discussion about the use of JIT stacks in |
| the section entitled "JIT stack FAQ" below. |
| |
| The pcre2_jit_stack_create() function creates a JIT stack. Its argu- |
| ments are a starting size, a maximum size, and a general context (for |
| memory allocation functions, or NULL for standard memory allocation). |
| It returns a pointer to an opaque structure of type pcre2_jit_stack, or |
| NULL if there is an error. The pcre2_jit_stack_free() function is used |
| to free a stack that is no longer needed. (For the technically minded: |
| the address space is allocated by mmap or VirtualAlloc.) |
| |
| JIT uses far less memory for recursion than the interpretive code, and |
| a maximum stack size of 512K to 1M should be more than enough for any |
| pattern. |
| |
| The pcre2_jit_stack_assign() function specifies which stack JIT code |
| should use. Its arguments are as follows: |
| |
| pcre2_match_context *mcontext |
| pcre2_jit_callback callback |
| void *data |
| |
| The first argument is a pointer to a match context. When this is subse- |
| quently passed to a matching function, its information determines which |
| JIT stack is used. There are three cases for the values of the other |
| two options: |
| |
| (1) If callback is NULL and data is NULL, an internal 32K block |
| on the machine stack is used. This is the default when a match |
| context is created. |
| |
| (2) If callback is NULL and data is not NULL, data must be |
| a pointer to a valid JIT stack, the result of calling |
| pcre2_jit_stack_create(). |
| |
| (3) If callback is not NULL, it must point to a function that is |
| called with data as an argument at the start of matching, in |
| order to set up a JIT stack. If the return from the callback |
| function is NULL, the internal 32K stack is used; otherwise the |
| return value must be a valid JIT stack, the result of calling |
| pcre2_jit_stack_create(). |
| |
| A callback function is obeyed whenever JIT code is about to be run; it |
| is not obeyed when pcre2_match() is called with options that are incom- |
| patible for JIT matching. A callback function can therefore be used to |
| determine whether a match operation was executed by JIT or by the |
| interpreter. |
| |
| You may safely use the same JIT stack for more than one pattern (either |
| by assigning directly or by callback), as long as the patterns are |
| matched sequentially in the same thread. Currently, the only way to set |
| up non-sequential matches in one thread is to use callouts: if a call- |
| out function starts another match, that match must use a different JIT |
| stack to the one used for currently suspended match(es). |
| |
| In a multithread application, if you do not specify a JIT stack, or if |
| you assign or pass back NULL from a callback, that is thread-safe, |
| because each thread has its own machine stack. However, if you assign |
| or pass back a non-NULL JIT stack, this must be a different stack for |
| each thread so that the application is thread-safe. |
| |
| Strictly speaking, even more is allowed. You can assign the same non- |
| NULL stack to a match context that is used by any number of patterns, |
| as long as they are not used for matching by multiple threads at the |
| same time. For example, you could use the same stack in all compiled |
| patterns, with a global mutex in the callback to wait until the stack |
| is available for use. However, this is an inefficient solution, and not |
| recommended. |
| |
| This is a suggestion for how a multithreaded program that needs to set |
| up non-default JIT stacks might operate: |
| |
| During thread initalization |
| thread_local_var = pcre2_jit_stack_create(...) |
| |
| During thread exit |
| pcre2_jit_stack_free(thread_local_var) |
| |
| Use a one-line callback function |
| return thread_local_var |
| |
| All the functions described in this section do nothing if JIT is not |
| available. |
| |
| |
| JIT STACK FAQ |
| |
| (1) Why do we need JIT stacks? |
| |
| PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack |
| where the local data of the current node is pushed before checking its |
| child nodes. Allocating real machine stack on some platforms is diffi- |
| cult. For example, the stack chain needs to be updated every time if we |
| extend the stack on PowerPC. Although it is possible, its updating |
| time overhead decreases performance. So we do the recursion in memory. |
| |
| (2) Why don't we simply allocate blocks of memory with malloc()? |
| |
| Modern operating systems have a nice feature: they can reserve an |
| address space instead of allocating memory. We can safely allocate mem- |
| ory pages inside this address space, so the stack could grow without |
| moving memory data (this is important because of pointers). Thus we can |
| allocate 1M address space, and use only a single memory page (usually |
| 4K) if that is enough. However, we can still grow up to 1M anytime if |
| needed. |
| |
| (3) Who "owns" a JIT stack? |
| |
| The owner of the stack is the user program, not the JIT studied pattern |
| or anything else. The user program must ensure that if a stack is being |
| used by pcre2_match(), (that is, it is assigned to a match context that |
| is passed to the pattern currently running), that stack must not be |
| used by any other threads (to avoid overwriting the same memory area). |
| The best practice for multithreaded programs is to allocate a stack for |
| each thread, and return this stack through the JIT callback function. |
| |
| (4) When should a JIT stack be freed? |
| |
| You can free a JIT stack at any time, as long as it will not be used by |
| pcre2_match() again. When you assign the stack to a match context, only |
| a pointer is set. There is no reference counting or any other magic. |
| You can free compiled patterns, contexts, and stacks in any order, any- |
| time. Just do not call pcre2_match() with a match context pointing to |
| an already freed stack, as that will cause SEGFAULT. (Also, do not free |
| a stack currently used by pcre2_match() in another thread). You can |
| also replace the stack in a context at any time when it is not in use. |
| You should free the previous stack before assigning a replacement. |
| |
| (5) Should I allocate/free a stack every time before/after calling |
| pcre2_match()? |
| |
| No, because this is too costly in terms of resources. However, you |
| could implement some clever idea which release the stack if it is not |
| used in let's say two minutes. The JIT callback can help to achieve |
| this without keeping a list of patterns. |
| |
| (6) OK, the stack is for long term memory allocation. But what happens |
| if a pattern causes stack overflow with a stack of 1M? Is that 1M kept |
| until the stack is freed? |
| |
| Especially on embedded sytems, it might be a good idea to release mem- |
| ory sometimes without freeing the stack. There is no API for this at |
| the moment. Probably a function call which returns with the currently |
| allocated memory for any stack and another which allows releasing mem- |
| ory (shrinking the stack) would be a good idea if someone needs this. |
| |
| (7) This is too much of a headache. Isn't there any better solution for |
| JIT stack handling? |
| |
| No, thanks to Windows. If POSIX threads were used everywhere, we could |
| throw out this complicated API. |
| |
| |
| FREEING JIT SPECULATIVE MEMORY |
| |
| void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext); |
| |
| The JIT executable allocator does not free all memory when it is possi- |
| ble. It expects new allocations, and keeps some free memory around to |
| improve allocation speed. However, in low memory conditions, it might |
| be better to free all possible memory. You can cause this to happen by |
| calling pcre2_jit_free_unused_memory(). Its argument is a general con- |
| text, for custom memory management, or NULL for standard memory manage- |
| ment. |
| |
| |
| EXAMPLE CODE |
| |
| This is a single-threaded example that specifies a JIT stack without |
| using a callback. A real program should include error checking after |
| all the function calls. |
| |
| int rc; |
| pcre2_code *re; |
| pcre2_match_data *match_data; |
| pcre2_match_context *mcontext; |
| pcre2_jit_stack *jit_stack; |
| |
| re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, 0, |
| &errornumber, &erroffset, NULL); |
| rc = pcre2_jit_compile(re, PCRE2_JIT_COMPLETE); |
| mcontext = pcre2_match_context_create(NULL); |
| jit_stack = pcre2_jit_stack_create(32*1024, 512*1024, NULL); |
| pcre2_jit_stack_assign(mcontext, NULL, jit_stack); |
| match_data = pcre2_match_data_create(re, 10); |
| rc = pcre2_match(re, subject, length, 0, 0, match_data, mcontext); |
| /* Process result */ |
| |
| pcre2_code_free(re); |
| pcre2_match_data_free(match_data); |
| pcre2_match_context_free(mcontext); |
| pcre2_jit_stack_free(jit_stack); |
| |
| |
| JIT FAST PATH API |
| |
| Because the API described above falls back to interpreted matching when |
| JIT is not available, it is convenient for programs that are written |
| for general use in many environments. However, calling JIT via |
| pcre2_match() does have a performance impact. Programs that are written |
| for use where JIT is known to be available, and which need the best |
| possible performance, can instead use a "fast path" API to call JIT |
| matching directly instead of calling pcre2_match() (obviously only for |
| patterns that have been successfully processed by pcre2_jit_compile()). |
| |
| The fast path function is called pcre2_jit_match(), and it takes |
| exactly the same arguments as pcre2_match(). The return values are also |
| the same, plus PCRE2_ERROR_JIT_BADOPTION if a matching mode (partial or |
| complete) is requested that was not compiled. Unsupported option bits |
| (for example, PCRE2_ANCHORED) are ignored. |
| |
| When you call pcre2_match(), as well as testing for invalid options, a |
| number of other sanity checks are performed on the arguments. For exam- |
| ple, if the subject pointer is NULL, an immediate error is given. Also, |
| unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested for |
| validity. In the interests of speed, these checks do not happen on the |
| JIT fast path, and if invalid data is passed, the result is undefined. |
| |
| Bypassing the sanity checks and the pcre2_match() wrapping can give |
| speedups of more than 10%. |
| |
| |
| SEE ALSO |
| |
| pcre2api(3) |
| |
| |
| AUTHOR |
| |
| Philip Hazel (FAQ by Zoltan Herczeg) |
| University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 14 November 2015 |
| Copyright (c) 1997-2015 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |
| PCRE2LIMITS(3) Library Functions Manual PCRE2LIMITS(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| |
| SIZE AND OTHER LIMITATIONS |
| |
| There are some size limitations in PCRE2 but it is hoped that they will |
| never in practice be relevant. |
| |
| The maximum size of a compiled pattern is approximately 64K code units |
| for the 8-bit and 16-bit libraries if PCRE2 is compiled with the |
| default internal linkage size, which is 2 bytes for these libraries. If |
| you want to process regular expressions that are truly enormous, you |
| can compile PCRE2 with an internal linkage size of 3 or 4 (when build- |
| ing the 16-bit library, 3 is rounded up to 4). See the README file in |
| the source distribution and the pcre2build documentation for details. |
| In these cases the limit is substantially larger. However, the speed |
| of execution is slower. In the 32-bit library, the internal linkage |
| size is always 4. |
| |
| The maximum length of a source pattern string is essentially unlimited; |
| it is the largest number a PCRE2_SIZE variable can hold. However, the |
| program that calls pcre2_compile() can specify a smaller limit. |
| |
| The maximum length (in code units) of a subject string is one less than |
| the largest number a PCRE2_SIZE variable can hold. PCRE2_SIZE is an |
| unsigned integer type, usually defined as size_t. Its maximum value |
| (that is ~(PCRE2_SIZE)0) is reserved as a special indicator for zero- |
| terminated strings and unset offsets. |
| |
| Note that when using the traditional matching function, PCRE2 uses |
| recursion to handle subpatterns and indefinite repetition. This means |
| that the available stack space may limit the size of a subject string |
| that can be processed by certain patterns. For a discussion of stack |
| issues, see the pcre2stack documentation. |
| |
| All values in repeating quantifiers must be less than 65536. |
| |
| The maximum length of a lookbehind assertion is 65535 characters. |
| |
| There is no limit to the number of parenthesized subpatterns, but there |
| can be no more than 65535 capturing subpatterns. There is, however, a |
| limit to the depth of nesting of parenthesized subpatterns of all |
| kinds. This is imposed in order to limit the amount of system stack |
| used at compile time. The limit can be specified when PCRE2 is built; |
| the default is 250. |
| |
| There is a limit to the number of forward references to subsequent sub- |
| patterns of around 200,000. Repeated forward references with fixed |
| upper limits, for example, (?2){0,100} when subpattern number 2 is to |
| the right, are included in the count. There is no limit to the number |
| of backward references. |
| |
| The maximum length of name for a named subpattern is 32 code units, and |
| the maximum number of named subpatterns is 10000. |
| |
| The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or |
| (*THEN) verb is 255 for the 8-bit library and 65535 for the 16-bit and |
| 32-bit libraries. |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 05 November 2015 |
| Copyright (c) 1997-2015 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |
| PCRE2MATCHING(3) Library Functions Manual PCRE2MATCHING(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| |
| PCRE2 MATCHING ALGORITHMS |
| |
| This document describes the two different algorithms that are available |
| in PCRE2 for matching a compiled regular expression against a given |
| subject string. The "standard" algorithm is the one provided by the |
| pcre2_match() function. This works in the same as as Perl's matching |
| function, and provide a Perl-compatible matching operation. The just- |
| in-time (JIT) optimization that is described in the pcre2jit documenta- |
| tion is compatible with this function. |
| |
| An alternative algorithm is provided by the pcre2_dfa_match() function; |
| it operates in a different way, and is not Perl-compatible. This alter- |
| native has advantages and disadvantages compared with the standard |
| algorithm, and these are described below. |
| |
| When there is only one possible way in which a given subject string can |
| match a pattern, the two algorithms give the same answer. A difference |
| arises, however, when there are multiple possibilities. For example, if |
| the pattern |
| |
| ^<.*> |
| |
| is matched against the string |
| |
| <something> <something else> <something further> |
| |
| there are three possible answers. The standard algorithm finds only one |
| of them, whereas the alternative algorithm finds all three. |
| |
| |
| REGULAR EXPRESSIONS AS TREES |
| |
| The set of strings that are matched by a regular expression can be rep- |
| resented as a tree structure. An unlimited repetition in the pattern |
| makes the tree of infinite size, but it is still a tree. Matching the |
| pattern to a given subject string (from a given starting point) can be |
| thought of as a search of the tree. There are two ways to search a |
| tree: depth-first and breadth-first, and these correspond to the two |
| matching algorithms provided by PCRE2. |
| |
| |
| THE STANDARD MATCHING ALGORITHM |
| |
| In the terminology of Jeffrey Friedl's book "Mastering Regular Expres- |
| sions", the standard algorithm is an "NFA algorithm". It conducts a |
| depth-first search of the pattern tree. That is, it proceeds along a |
| single path through the tree, checking that the subject matches what is |
| required. When there is a mismatch, the algorithm tries any alterna- |
| tives at the current point, and if they all fail, it backs up to the |
| previous branch point in the tree, and tries the next alternative |
| branch at that level. This often involves backing up (moving to the |
| left) in the subject string as well. The order in which repetition |
| branches are tried is controlled by the greedy or ungreedy nature of |
| the quantifier. |
| |
| If a leaf node is reached, a matching string has been found, and at |
| that point the algorithm stops. Thus, if there is more than one possi- |
| ble match, this algorithm returns the first one that it finds. Whether |
| this is the shortest, the longest, or some intermediate length depends |
| on the way the greedy and ungreedy repetition quantifiers are specified |
| in the pattern. |
| |
| Because it ends up with a single path through the tree, it is rela- |
| tively straightforward for this algorithm to keep track of the sub- |
| strings that are matched by portions of the pattern in parentheses. |
| This provides support for capturing parentheses and back references. |
| |
| |
| THE ALTERNATIVE MATCHING ALGORITHM |
| |
| This algorithm conducts a breadth-first search of the tree. Starting |
| from the first matching point in the subject, it scans the subject |
| string from left to right, once, character by character, and as it does |
| this, it remembers all the paths through the tree that represent valid |
| matches. In Friedl's terminology, this is a kind of "DFA algorithm", |
| though it is not implemented as a traditional finite state machine (it |
| keeps multiple states active simultaneously). |
| |
| Although the general principle of this matching algorithm is that it |
| scans the subject string only once, without backtracking, there is one |
| exception: when a lookaround assertion is encountered, the characters |
| following or preceding the current point have to be independently |
| inspected. |
| |
| The scan continues until either the end of the subject is reached, or |
| there are no more unterminated paths. At this point, terminated paths |
| represent the different matching possibilities (if there are none, the |
| match has failed). Thus, if there is more than one possible match, |
| this algorithm finds all of them, and in particular, it finds the long- |
| est. The matches are returned in decreasing order of length. There is |
| an option to stop the algorithm after the first match (which is neces- |
| sarily the shortest) is found. |
| |
| Note that all the matches that are found start at the same point in the |
| subject. If the pattern |
| |
| cat(er(pillar)?)? |
| |
| is matched against the string "the caterpillar catchment", the result |
| is the three strings "caterpillar", "cater", and "cat" that start at |
| the fifth character of the subject. The algorithm does not automati- |
| cally move on to find matches that start at later positions. |
| |
| PCRE2's "auto-possessification" optimization usually applies to charac- |
| ter repeats at the end of a pattern (as well as internally). For exam- |
| ple, the pattern "a\d+" is compiled as if it were "a\d++" because there |
| is no point even considering the possibility of backtracking into the |
| repeated digits. For DFA matching, this means that only one possible |
| match is found. If you really do want multiple matches in such cases, |
| either use an ungreedy repeat ("a\d+?") or set the PCRE2_NO_AUTO_POS- |
| SESS option when compiling. |
| |
| There are a number of features of PCRE2 regular expressions that are |
| not supported by the alternative matching algorithm. They are as fol- |
| lows: |
| |
| 1. Because the algorithm finds all possible matches, the greedy or |
| ungreedy nature of repetition quantifiers is not relevant (though it |
| may affect auto-possessification, as just described). During matching, |
| greedy and ungreedy quantifiers are treated in exactly the same way. |
| However, possessive quantifiers can make a difference when what follows |
| could also match what is quantified, for example in a pattern like |
| this: |
| |
| ^a++\w! |
| |
| This pattern matches "aaab!" but not "aaa!", which would be matched by |
| a non-possessive quantifier. Similarly, if an atomic group is present, |
| it is matched as if it were a standalone pattern at the current point, |
| and the longest match is then "locked in" for the rest of the overall |
| pattern. |
| |
| 2. When dealing with multiple paths through the tree simultaneously, it |
| is not straightforward to keep track of captured substrings for the |
| different matching possibilities, and PCRE2's implementation of this |
| algorithm does not attempt to do this. This means that no captured sub- |
| strings are available. |
| |
| 3. Because no substrings are captured, back references within the pat- |
| tern are not supported, and cause errors if encountered. |
| |
| 4. For the same reason, conditional expressions that use a backrefer- |
| ence as the condition or test for a specific group recursion are not |
| supported. |
| |
| 5. Because many paths through the tree may be active, the \K escape |
| sequence, which resets the start of the match when encountered (but may |
| be on some paths and not on others), is not supported. It causes an |
| error if encountered. |
| |
| 6. Callouts are supported, but the value of the capture_top field is |
| always 1, and the value of the capture_last field is always 0. |
| |
| 7. The \C escape sequence, which (in the standard algorithm) always |
| matches a single code unit, even in a UTF mode, is not supported in |
| these modes, because the alternative algorithm moves through the sub- |
| ject string one character (not code unit) at a time, for all active |
| paths through the tree. |
| |
| 8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) |
| are not supported. (*FAIL) is supported, and behaves like a failing |
| negative assertion. |
| |
| |
| ADVANTAGES OF THE ALTERNATIVE ALGORITHM |
| |
| Using the alternative matching algorithm provides the following advan- |
| tages: |
| |
| 1. All possible matches (at a single point in the subject) are automat- |
| ically found, and in particular, the longest match is found. To find |
| more than one match using the standard algorithm, you have to do kludgy |
| things with callouts. |
| |
| 2. Because the alternative algorithm scans the subject string just |
| once, and never needs to backtrack (except for lookbehinds), it is pos- |
| sible to pass very long subject strings to the matching function in |
| several pieces, checking for partial matching each time. Although it is |
| also possible to do multi-segment matching using the standard algo- |
| rithm, by retaining partially matched substrings, it is more compli- |
| cated. The pcre2partial documentation gives details of partial matching |
| and discusses multi-segment matching. |
| |
| |
| DISADVANTAGES OF THE ALTERNATIVE ALGORITHM |
| |
| The alternative algorithm suffers from a number of disadvantages: |
| |
| 1. It is substantially slower than the standard algorithm. This is |
| partly because it has to search for all possible matches, but is also |
| because it is less susceptible to optimization. |
| |
| 2. Capturing parentheses and back references are not supported. |
| |
| 3. Although atomic groups are supported, their use does not provide the |
| performance advantage that it does for the standard algorithm. |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 29 September 2014 |
| Copyright (c) 1997-2014 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |
| PCRE2PARTIAL(3) Library Functions Manual PCRE2PARTIAL(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions |
| |
| PARTIAL MATCHING IN PCRE2 |
| |
| In normal use of PCRE2, if the subject string that is passed to a |
| matching function matches as far as it goes, but is too short to match |
| the entire pattern, PCRE2_ERROR_NOMATCH is returned. There are circum- |
| stances where it might be helpful to distinguish this case from other |
| cases in which there is no match. |
| |
| Consider, for example, an application where a human is required to type |
| in data for a field with specific formatting requirements. An example |
| might be a date in the form ddmmmyy, defined by this pattern: |
| |
| ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$ |
| |
| If the application sees the user's keystrokes one by one, and can check |
| that what has been typed so far is potentially valid, it is able to |
| raise an error as soon as a mistake is made, by beeping and not |
| reflecting the character that has been typed, for example. This immedi- |
| ate feedback is likely to be a better user interface than a check that |
| is delayed until the entire string has been entered. Partial matching |
| can also be useful when the subject string is very long and is not all |
| available at once. |
| |
| PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and |
| PCRE2_PARTIAL_HARD options, which can be set when calling a matching |
| function. The difference between the two options is whether or not a |
| partial match is preferred to an alternative complete match, though the |
| details differ between the two types of matching function. If both |
| options are set, PCRE2_PARTIAL_HARD takes precedence. |
| |
| If you want to use partial matching with just-in-time optimized code, |
| you must call pcre2_jit_compile() with one or both of these options: |
| |
| PCRE2_JIT_PARTIAL_SOFT |
| PCRE2_JIT_PARTIAL_HARD |
| |
| PCRE2_JIT_COMPLETE should also be set if you are going to run non-par- |
| tial matches on the same pattern. If the appropriate JIT mode has not |
| been compiled, interpretive matching code is used. |
| |
| Setting a partial matching option disables two of PCRE2's standard |
| optimizations. PCRE2 remembers the last literal code unit in a pattern, |
| and abandons matching immediately if it is not present in the subject |
| string. This optimization cannot be used for a subject string that |
| might match only partially. PCRE2 also knows the minimum length of a |
| matching string, and does not bother to run the matching function on |
| shorter strings. This optimization is also disabled for partial match- |
| ing. |
| |
| |
| PARTIAL MATCHING USING pcre2_match() |
| |
| A partial match occurs during a call to pcre2_match() when the end of |
| the subject string is reached successfully, but matching cannot con- |
| tinue because more characters are needed. However, at least one charac- |
| ter in the subject must have been inspected. This character need not |
| form part of the final matched string; lookbehind assertions and the \K |
| escape sequence provide ways of inspecting characters before the start |
| of a matched string. The requirement for inspecting at least one char- |
| acter exists because an empty string can always be matched; without |
| such a restriction there would always be a partial match of an empty |
| string at the end of the subject. |
| |
| When a partial match is returned, the first two elements in the ovector |
| point to the portion of the subject that was matched, but the values in |
| the rest of the ovector are undefined. The appearance of \K in the pat- |
| tern has no effect for a partial match. Consider this pattern: |
| |
| /abc\K123/ |
| |
| If it is matched against "456abc123xyz" the result is a complete match, |
| and the ovector defines the matched string as "123", because \K resets |
| the "start of match" point. However, if a partial match is requested |
| and the subject string is "456abc12", a partial match is found for the |
| string "abc12", because all these characters are needed for a subse- |
| quent re-match with additional characters. |
| |
| What happens when a partial match is identified depends on which of the |
| two partial matching options are set. |
| |
| PCRE2_PARTIAL_SOFT WITH pcre2_match() |
| |
| If PCRE2_PARTIAL_SOFT is set when pcre2_match() identifies a partial |
| match, the partial match is remembered, but matching continues as nor- |
| mal, and other alternatives in the pattern are tried. If no complete |
| match can be found, PCRE2_ERROR_PARTIAL is returned instead of |
| PCRE2_ERROR_NOMATCH. |
| |
| This option is "soft" because it prefers a complete match over a par- |
| tial match. All the various matching items in a pattern behave as if |
| the subject string is potentially complete. For example, \z, \Z, and $ |
| match at the end of the subject, as normal, and for \b and \B the end |
| of the subject is treated as a non-alphanumeric. |
| |
| If there is more than one partial match, the first one that was found |
| provides the data that is returned. Consider this pattern: |
| |
| /123\w+X|dogY/ |
| |
| If this is matched against the subject string "abc123dog", both alter- |
| natives fail to match, but the end of the subject is reached during |
| matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 |
| and 9, identifying "123dog" as the first partial match that was found. |
| (In this example, there are two partial matches, because "dog" on its |
| own partially matches the second alternative.) |
| |
| PCRE2_PARTIAL_HARD WITH pcre2_match() |
| |
| If PCRE2_PARTIAL_HARD is set for pcre2_match(), PCRE2_ERROR_PARTIAL is |
| returned as soon as a partial match is found, without continuing to |
| search for possible complete matches. This option is "hard" because it |
| prefers an earlier partial match over a later complete match. For this |
| reason, the assumption is made that the end of the supplied subject |
| string may not be the true end of the available data, and so, if \z, |
| \Z, \b, \B, or $ are encountered at the end of the subject, the result |
| is PCRE2_ERROR_PARTIAL, provided that at least one character in the |
| subject has been inspected. |
| |
| Comparing hard and soft partial matching |
| |
| The difference between the two partial matching options can be illus- |
| trated by a pattern such as: |
| |
| /dog(sbody)?/ |
| |
| This matches either "dog" or "dogsbody", greedily (that is, it prefers |
| the longer string if possible). If it is matched against the string |
| "dog" with PCRE2_PARTIAL_SOFT, it yields a complete match for "dog". |
| However, if PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PAR- |
| TIAL. On the other hand, if the pattern is made ungreedy the result is |
| different: |
| |
| /dog(sbody)??/ |
| |
| In this case the result is always a complete match because that is |
| found first, and matching never continues after finding a complete |
| match. It might be easier to follow this explanation by thinking of the |
| two patterns like this: |
| |
| /dog(sbody)?/ is the same as /dogsbody|dog/ |
| /dog(sbody)??/ is the same as /dog|dogsbody/ |
| |
| The second pattern will never match "dogsbody", because it will always |
| find the shorter match first. |
| |
| |
| PARTIAL MATCHING USING pcre2_dfa_match() |
| |
| The DFA functions move along the subject string character by character, |
| without backtracking, searching for all possible matches simultane- |
| ously. If the end of the subject is reached before the end of the pat- |
| tern, there is the possibility of a partial match, again provided that |
| at least one character has been inspected. |
| |
| When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if |
| there have been no complete matches. Otherwise, the complete matches |
| are returned. However, if PCRE2_PARTIAL_HARD is set, a partial match |
| takes precedence over any complete matches. The portion of the string |
| that was matched when the longest partial match was found is set as the |
| first matching string. |
| |
| Because the DFA functions always search for all possible matches, and |
| there is no difference between greedy and ungreedy repetition, their |
| behaviour is different from the standard functions when PCRE2_PAR- |
| TIAL_HARD is set. Consider the string "dog" matched against the |
| ungreedy pattern shown above: |
| |
| /dog(sbody)??/ |
| |
| Whereas the standard function stops as soon as it finds the complete |
| match for "dog", the DFA function also finds the partial match for |
| "dogsbody", and so returns that when PCRE2_PARTIAL_HARD is set. |
| |
| |
| PARTIAL MATCHING AND WORD BOUNDARIES |
| |
| If a pattern ends with one of sequences \b or \B, which test for word |
| boundaries, partial matching with PCRE2_PARTIAL_SOFT can give counter- |
| intuitive results. Consider this pattern: |
| |
| /\bcat\b/ |
| |
| This matches "cat", provided there is a word boundary at either end. If |
| the subject string is "the cat", the comparison of the final "t" with a |
| following character cannot take place, so a partial match is found. |
| However, normal matching carries on, and \b matches at the end of the |
| subject when the last character is a letter, so a complete match is |
| found. The result, therefore, is not PCRE2_ERROR_PARTIAL. Using |
| PCRE2_PARTIAL_HARD in this case does yield PCRE2_ERROR_PARTIAL, because |
| then the partial match takes precedence. |
| |
| |
| EXAMPLE OF PARTIAL MATCHING USING PCRE2TEST |
| |
| If the partial_soft (or ps) modifier is present on a pcre2test data |
| line, the PCRE2_PARTIAL_SOFT option is used for the match. Here is a |
| run of pcre2test that uses the date example quoted above: |
| |
| re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ |
| data> 25jun04\=ps |
| 0: 25jun04 |
| 1: jun |
| data> 25dec3\=ps |
| Partial match: 23dec3 |
| data> 3ju\=ps |
| Partial match: 3ju |
| data> 3juj\=ps |
| No match |
| data> j\=ps |
| No match |
| |
| The first data string is matched completely, so pcre2test shows the |
| matched substrings. The remaining four strings do not match the com- |
| plete pattern, but the first two are partial matches. Similar output is |
| obtained if DFA matching is used. |
| |
| If the partial_hard (or ph) modifier is present on a pcre2test data |
| line, the PCRE2_PARTIAL_HARD option is set for the match. |
| |
| |
| MULTI-SEGMENT MATCHING WITH pcre2_dfa_match() |
| |
| When a partial match has been found using a DFA matching function, it |
| is possible to continue the match by providing additional subject data |
| and calling the function again with the same compiled regular expres- |
| sion, this time setting the PCRE2_DFA_RESTART option. You must pass the |
| same working space as before, because this is where details of the pre- |
| vious partial match are stored. Here is an example using pcre2test: |
| |
| re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ |
| data> 23ja\=dfa,ps |
| Partial match: 23ja |
| data> n05\=dfa,dfa_restart |
| 0: n05 |
| |
| The first call has "23ja" as the subject, and requests partial match- |
| ing; the second call has "n05" as the subject for the continued |
| (restarted) match. Notice that when the match is complete, only the |
| last part is shown; PCRE2 does not retain the previously partially- |
| matched string. It is up to the calling program to do that if it needs |
| to. |
| |
| That means that, for an unanchored pattern, if a continued match fails, |
| it is not possible to try again at a new starting point. All this |
| facility is capable of doing is continuing with the previous match |
| attempt. In the previous example, if the second set of data is "ug23" |
| the result is no match, even though there would be a match for "aug23" |
| if the entire string were given at once. Depending on the application, |
| this may or may not be what you want. The only way to allow for start- |
| ing again at the next character is to retain the matched part of the |
| subject and try a new complete match. |
| |
| You can set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with |
| PCRE2_DFA_RESTART to continue partial matching over multiple segments. |
| This facility can be used to pass very long subject strings to the DFA |
| matching functions. |
| |
| |
| MULTI-SEGMENT MATCHING WITH pcre2_match() |
| |
| Unlike the DFA function, it is not possible to restart the previous |
| match with a new segment of data when using pcre2_match(). Instead, new |
| data must be added to the previous subject string, and the entire match |
| re-run, starting from the point where the partial match occurred. Ear- |
| lier data can be discarded. |
| |
| It is best to use PCRE2_PARTIAL_HARD in this situation, because it does |
| not treat the end of a segment as the end of the subject when matching |
| \z, \Z, \b, \B, and $. Consider an unanchored pattern that matches |
| dates: |
| |
| re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/ |
| data> The date is 23ja\=ph |
| Partial match: 23ja |
| |
| At this stage, an application could discard the text preceding "23ja", |
| add on text from the next segment, and call the matching function |
| again. Unlike the DFA matching function, the entire matching string |
| must always be available, and the complete matching process occurs for |
| each call, so more memory and more processing time is needed. |
| |
| |
| ISSUES WITH MULTI-SEGMENT MATCHING |
| |
| Certain types of pattern may give problems with multi-segment matching, |
| whichever matching function is used. |
| |
| 1. If the pattern contains a test for the beginning of a line, you need |
| to pass the PCRE2_NOTBOL option when the subject string for any call |
| does start at the beginning of a line. There is also a PCRE2_NOTEOL |
| option, but in practice when doing multi-segment matching you should be |
| using PCRE2_PARTIAL_HARD, which includes the effect of PCRE2_NOTEOL. |
| |
| 2. If a pattern contains a lookbehind assertion, characters that pre- |
| cede the start of the partial match may have been inspected during the |
| matching process. When using pcre2_match(), sufficient characters must |
| be retained for the next match attempt. You can ensure that enough |
| characters are retained by doing the following: |
| |
| Before doing any matching, find the length of the longest lookbehind in |
| the pattern by calling pcre2_pattern_info() with the |
| PCRE2_INFO_MAXLOOKBEHIND option. Note that the resulting count is in |
| characters, not code units. After a partial match, moving back from the |
| ovector[0] offset in the subject by the number of characters given for |
| the maximum lookbehind gets you to the earliest character that must be |
| retained. In a non-UTF or a 32-bit situation, moving back is just a |
| subtraction, but in UTF-8 or UTF-16 you have to count characters while |
| moving back through the code units. |
| |
| Characters before the point you have now reached can be discarded, and |
| after the next segment has been added to what is retained, you should |
| run the next match with the startoffset argument set so that the match |
| begins at the same point as before. |
| |
| For example, if the pattern "(?<=123)abc" is partially matched against |
| the string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maxi- |
| mum lookbehind count is 3, so all characters before offset 2 can be |
| discarded. The value of startoffset for the next match should be 3. |
| When pcre2test displays a partial match, it indicates the lookbehind |
| characters with '<' characters: |
| |
| re> "(?<=123)abc" |
| data> xx123ab\=ph |
| Partial match: 123ab |
| <<< |
| |
| 3. Because a partial match must always contain at least one character, |
| what might be considered a partial match of an empty string actually |
| gives a "no match" result. For example: |
| |
| re> /c(?<=abc)x/ |
| data> ab\=ps |
| No match |
| |
| If the next segment begins "cx", a match should be found, but this will |
| only happen if characters from the previous segment are retained. For |
| this reason, a "no match" result should be interpreted as "partial |
| match of an empty string" when the pattern contains lookbehinds. |
| |
| 4. Matching a subject string that is split into multiple segments may |
| not always produce exactly the same result as matching over one single |
| long string, especially when PCRE2_PARTIAL_SOFT is used. The section |
| "Partial Matching and Word Boundaries" above describes an issue that |
| arises if the pattern ends with \b or \B. Another kind of difference |
| may occur when there are multiple matching possibilities, because (for |
| PCRE2_PARTIAL_SOFT) a partial match result is given only when there are |
| no completed matches. This means that as soon as the shortest match has |
| been found, continuation to a new subject segment is no longer possi- |
| ble. Consider this pcre2test example: |
| |
| re> /dog(sbody)?/ |
| data> dogsb\=ps |
| 0: dog |
| data> do\=ps,dfa |
| Partial match: do |
| data> gsb\=ps,dfa,dfa_restart |
| 0: g |
| data> dogsbody\=dfa |
| 0: dogsbody |
| 1: dog |
| |
| The first data line passes the string "dogsb" to a standard matching |
| function, setting the PCRE2_PARTIAL_SOFT option. Although the string is |
| a partial match for "dogsbody", the result is not PCRE2_ERROR_PARTIAL, |
| because the shorter string "dog" is a complete match. Similarly, when |
| the subject is presented to a DFA matching function in several parts |
| ("do" and "gsb" being the first two) the match stops when "dog" has |
| been found, and it is not possible to continue. On the other hand, if |
| "dogsbody" is presented as a single string, a DFA matching function |
| finds both matches. |
| |
| Because of these problems, it is best to use PCRE2_PARTIAL_HARD when |
| matching multi-segment data. The example above then behaves differ- |
| ently: |
| |
| re> /dog(sbody)?/ |
| data> dogsb\=ph |
| Partial match: dogsb |
| data> do\=ps,dfa |
| Partial match: do |
| data> gsb\=ph,dfa,dfa_restart |
| Partial match: gsb |
| |
| 5. Patterns that contain alternatives at the top level which do not all |
| start with the same pattern item may not work as expected when |
| PCRE2_DFA_RESTART is used. For example, consider this pattern: |
| |
| 1234|3789 |
| |
| If the first part of the subject is "ABC123", a partial match of the |
| first alternative is found at offset 3. There is no partial match for |
| the second alternative, because such a match does not start at the same |
| point in the subject string. Attempting to continue with the string |
| "7890" does not yield a match because only those alternatives that |
| match at one point in the subject are remembered. The problem arises |
| because the start of the second alternative matches within the first |
| alternative. There is no problem with anchored patterns or patterns |
| such as: |
| |
| 1234|ABCD |
| |
| where no string can be a partial match for both alternatives. This is |
| not a problem if a standard matching function is used, because the |
| entire match has to be rerun each time: |
| |
| re> /1234|3789/ |
| data> ABC123\=ph |
| Partial match: 123 |
| data> 1237890 |
| 0: 3789 |
| |
| Of course, instead of using PCRE2_DFA_RESTART, the same technique of |
| re-running the entire match can also be used with the DFA matching |
| function. Another possibility is to work with two buffers. If a partial |
| match at offset n in the first buffer is followed by "no match" when |
| PCRE2_DFA_RESTART is used on the second buffer, you can then try a new |
| match starting at offset n+1 in the first buffer. |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 22 December 2014 |
| Copyright (c) 1997-2014 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |
| PCRE2PATTERN(3) Library Functions Manual PCRE2PATTERN(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| |
| PCRE2 REGULAR EXPRESSION DETAILS |
| |
| The syntax and semantics of the regular expressions that are supported |
| by PCRE2 are described in detail below. There is a quick-reference syn- |
| tax summary in the pcre2syntax page. PCRE2 tries to match Perl syntax |
| and semantics as closely as it can. PCRE2 also supports some alterna- |
| tive regular expression syntax (which does not conflict with the Perl |
| syntax) in order to provide some compatibility with regular expressions |
| in Python, .NET, and Oniguruma. |
| |
| Perl's regular expressions are described in its own documentation, and |
| regular expressions in general are covered in a number of books, some |
| of which have copious examples. Jeffrey Friedl's "Mastering Regular |
| Expressions", published by O'Reilly, covers regular expressions in |
| great detail. This description of PCRE2's regular expressions is |
| intended as reference material. |
| |
| This document discusses the patterns that are supported by PCRE2 when |
| its main matching function, pcre2_match(), is used. PCRE2 also has an |
| alternative matching function, pcre2_dfa_match(), which matches using a |
| different algorithm that is not Perl-compatible. Some of the features |
| discussed below are not available when DFA matching is used. The advan- |
| tages and disadvantages of the alternative function, and how it differs |
| from the normal function, are discussed in the pcre2matching page. |
| |
| |
| SPECIAL START-OF-PATTERN ITEMS |
| |
| A number of options that can be passed to pcre2_compile() can also be |
| set by special items at the start of a pattern. These are not Perl-com- |
| patible, but are provided to make these options accessible to pattern |
| writers who are not able to change the program that processes the pat- |
| tern. Any number of these items may appear, but they must all be |
| together right at the start of the pattern string, and the letters must |
| be in upper case. |
| |
| UTF support |
| |
| In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either |
| as single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32 |
| can be specified for the 32-bit library, in which case it constrains |
| the character values to valid Unicode code points. To process UTF |
| strings, PCRE2 must be built to include Unicode support (which is the |
| default). When using UTF strings you must either call the compiling |
| function with the PCRE2_UTF option, or the pattern must start with the |
| special sequence (*UTF), which is equivalent to setting the relevant |
| option. How setting a UTF mode affects pattern matching is mentioned in |
| several places below. There is also a summary of features in the |
| pcre2unicode page. |
| |
| Some applications that allow their users to supply patterns may wish to |
| restrict them to non-UTF data for security reasons. If the |
| PCRE2_NEVER_UTF option is passed to pcre2_compile(), (*UTF) is not |
| allowed, and its appearance in a pattern causes an error. |
| |
| Unicode property support |
| |
| Another special sequence that may appear at the start of a pattern is |
| (*UCP). This has the same effect as setting the PCRE2_UCP option: it |
| causes sequences such as \d and \w to use Unicode properties to deter- |
| mine character types, instead of recognizing only characters with codes |
| less than 256 via a lookup table. |
| |
| Some applications that allow their users to supply patterns may wish to |
| restrict them for security reasons. If the PCRE2_NEVER_UCP option is |
| passed to pcre2_compile(), (*UCP) is not allowed, and its appearance in |
| a pattern causes an error. |
| |
| Locking out empty string matching |
| |
| Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same |
| effect as passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option |
| to whichever matching function is subsequently called to match the pat- |
| tern. These options lock out the matching of empty strings, either |
| entirely, or only at the start of the subject. |
| |
| Disabling auto-possessification |
| |
| If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as |
| setting the PCRE2_NO_AUTO_POSSESS option. This stops PCRE2 from making |
| quantifiers possessive when what follows cannot match the repeated |
| item. For example, by default a+b is treated as a++b. For more details, |
| see the pcre2api documentation. |
| |
| Disabling start-up optimizations |
| |
| If a pattern starts with (*NO_START_OPT), it has the same effect as |
| setting the PCRE2_NO_START_OPTIMIZE option. This disables several opti- |
| mizations for quickly reaching "no match" results. For more details, |
| see the pcre2api documentation. |
| |
| Disabling automatic anchoring |
| |
| If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect |
| as setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimiza- |
| tions that apply to patterns whose top-level branches all start with .* |
| (match any number of arbitrary characters). For more details, see the |
| pcre2api documentation. |
| |
| Disabling JIT compilation |
| |
| If a pattern that starts with (*NO_JIT) is successfully compiled, an |
| attempt by the application to apply the JIT optimization by calling |
| pcre2_jit_compile() is ignored. |
| |
| Setting match and recursion limits |
| |
| The caller of pcre2_match() can set a limit on the number of times the |
| internal match() function is called and on the maximum depth of recur- |
| sive calls. These facilities are provided to catch runaway matches that |
| are provoked by patterns with huge matching trees (a typical example is |
| a pattern with nested unlimited repeats) and to avoid running out of |
| system stack by too much recursion. When one of these limits is |
| reached, pcre2_match() gives an error return. The limits can also be |
| set by items at the start of the pattern of the form |
| |
| (*LIMIT_MATCH=d) |
| (*LIMIT_RECURSION=d) |
| |
| where d is any number of decimal digits. However, the value of the set- |
| ting must be less than the value set (or defaulted) by the caller of |
| pcre2_match() for it to have any effect. In other words, the pattern |
| writer can lower the limits set by the programmer, but not raise them. |
| If there is more than one setting of one of these limits, the lower |
| value is used. |
| |
| Newline conventions |
| |
| PCRE2 supports five different conventions for indicating line breaks in |
| strings: a single CR (carriage return) character, a single LF (line- |
| feed) character, the two-character sequence CRLF, any of the three pre- |
| ceding, or any Unicode newline sequence. The pcre2api page has further |
| discussion about newlines, and shows how to set the newline convention |
| when calling pcre2_compile(). |
| |
| It is also possible to specify a newline convention by starting a pat- |
| tern string with one of the following five sequences: |
| |
| (*CR) carriage return |
| (*LF) linefeed |
| (*CRLF) carriage return, followed by linefeed |
| (*ANYCRLF) any of the three above |
| (*ANY) all Unicode newline sequences |
| |
| These override the default and the options given to the compiling func- |
| tion. For example, on a Unix system where LF is the default newline |
| sequence, the pattern |
| |
| (*CR)a.b |
| |
| changes the convention to CR. That pattern matches "a\nb" because LF is |
| no longer a newline. If more than one of these settings is present, the |
| last one is used. |
| |
| The newline convention affects where the circumflex and dollar asser- |
| tions are true. It also affects the interpretation of the dot metachar- |
| acter when PCRE2_DOTALL is not set, and the behaviour of \N. However, |
| it does not affect what the \R escape sequence matches. By default, |
| this is any Unicode newline sequence, for Perl compatibility. However, |
| this can be changed; see the description of \R in the section entitled |
| "Newline sequences" below. A change of \R setting can be combined with |
| a change of newline convention. |
| |
| Specifying what \R matches |
| |
| It is possible to restrict \R to match only CR, LF, or CRLF (instead of |
| the complete set of Unicode line endings) by setting the option |
| PCRE2_BSR_ANYCRLF at compile time. This effect can also be achieved by |
| starting a pattern with (*BSR_ANYCRLF). For completeness, (*BSR_UNI- |
| CODE) is also recognized, corresponding to PCRE2_BSR_UNICODE. |
| |
| |
| EBCDIC CHARACTER CODES |
| |
| PCRE2 can be compiled to run in an environment that uses EBCDIC as its |
| character code rather than ASCII or Unicode (typically a mainframe sys- |
| tem). In the sections below, character code values are ASCII or Uni- |
| code; in an EBCDIC environment these characters may have different code |
| values, and there are no code points greater than 255. |
| |
| |
| CHARACTERS AND METACHARACTERS |
| |
| A regular expression is a pattern that is matched against a subject |
| string from left to right. Most characters stand for themselves in a |
| pattern, and match the corresponding characters in the subject. As a |
| trivial example, the pattern |
| |
| The quick brown fox |
| |
| matches a portion of a subject string that is identical to itself. When |
| caseless matching is specified (the PCRE2_CASELESS option), letters are |
| matched independently of case. |
| |
| The power of regular expressions comes from the ability to include |
| alternatives and repetitions in the pattern. These are encoded in the |
| pattern by the use of metacharacters, which do not stand for themselves |
| but instead are interpreted in some special way. |
| |
| There are two different sets of metacharacters: those that are recog- |
| nized anywhere in the pattern except within square brackets, and those |
| that are recognized within square brackets. Outside square brackets, |
| the metacharacters are as follows: |
| |
| \ general escape character with several uses |
| ^ assert start of string (or line, in multiline mode) |
| $ assert end of string (or line, in multiline mode) |
| . match any character except newline (by default) |
| [ start character class definition |
| | start of alternative branch |
| ( start subpattern |
| ) end subpattern |
| ? extends the meaning of ( |
| also 0 or 1 quantifier |
| also quantifier minimizer |
| * 0 or more quantifier |
| + 1 or more quantifier |
| also "possessive quantifier" |
| { start min/max quantifier |
| |
| Part of a pattern that is in square brackets is called a "character |
| class". In a character class the only metacharacters are: |
| |
| \ general escape character |
| ^ negate the class, but only if the first character |
| - indicates character range |
| [ POSIX character class (only if followed by POSIX |
| syntax) |
| ] terminates the character class |
| |
| The following sections describe the use of each of the metacharacters. |
| |
| |
| BACKSLASH |
| |
| The backslash character has several uses. Firstly, if it is followed by |
| a character that is not a number or a letter, it takes away any special |
| meaning that character may have. This use of backslash as an escape |
| character applies both inside and outside character classes. |
| |
| For example, if you want to match a * character, you write \* in the |
| pattern. This escaping action applies whether or not the following |
| character would otherwise be interpreted as a metacharacter, so it is |
| always safe to precede a non-alphanumeric with backslash to specify |
| that it stands for itself. In particular, if you want to match a back- |
| slash, you write \\. |
| |
| In a UTF mode, only ASCII numbers and letters have any special meaning |
| after a backslash. All other characters (in particular, those whose |
| codepoints are greater than 127) are treated as literals. |
| |
| If a pattern is compiled with the PCRE2_EXTENDED option, most white |
| space in the pattern (other than in a character class), and characters |
| between a # outside a character class and the next newline, inclusive, |
| are ignored. An escaping backslash can be used to include a white space |
| or # character as part of the pattern. |
| |
| If you want to remove the special meaning from a sequence of charac- |
| ters, you can do so by putting them between \Q and \E. This is differ- |
| ent from Perl in that $ and @ are handled as literals in \Q...\E |
| sequences in PCRE2, whereas in Perl, $ and @ cause variable interpola- |
| tion. Note the following examples: |
| |
| Pattern PCRE2 matches Perl matches |
| |
| \Qabc$xyz\E abc$xyz abc followed by the |
| contents of $xyz |
| \Qabc\$xyz\E abc\$xyz abc\$xyz |
| \Qabc\E\$\Qxyz\E abc$xyz abc$xyz |
| |
| The \Q...\E sequence is recognized both inside and outside character |
| classes. An isolated \E that is not preceded by \Q is ignored. If \Q |
| is not followed by \E later in the pattern, the literal interpretation |
| continues to the end of the pattern (that is, \E is assumed at the |
| end). If the isolated \Q is inside a character class, this causes an |
| error, because the character class is not terminated. |
| |
| Non-printing characters |
| |
| A second use of backslash provides a way of encoding non-printing char- |
| acters in patterns in a visible manner. There is no restriction on the |
| appearance of non-printing characters in a pattern, but when a pattern |
| is being prepared by text editing, it is often easier to use one of the |
| following escape sequences than the binary character it represents. In |
| an ASCII or Unicode environment, these escapes are as follows: |
| |
| \a alarm, that is, the BEL character (hex 07) |
| \cx "control-x", where x is any printable ASCII character |
| \e escape (hex 1B) |
| \f form feed (hex 0C) |
| \n linefeed (hex 0A) |
| \r carriage return (hex 0D) |
| \t tab (hex 09) |
| \0dd character with octal code 0dd |
| \ddd character with octal code ddd, or back reference |
| \o{ddd..} character with octal code ddd.. |
| \xhh character with hex code hh |
| \x{hhh..} character with hex code hhh.. (default mode) |
| \uhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set) |
| |
| The precise effect of \cx on ASCII characters is as follows: if x is a |
| lower case letter, it is converted to upper case. Then bit 6 of the |
| character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A |
| (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes |
| hex 7B (; is 3B). If the code unit following \c has a value less than |
| 32 or greater than 126, a compile-time error occurs. This locks out |
| non-printable ASCII characters in all modes. |
| |
| When PCRE2 is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t gen- |
| erate the appropriate EBCDIC code values. The \c escape is processed as |
| specified for Perl in the perlebcdic document. The only characters that |
| are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^, _, or ?. |
| Any other character provokes a compile-time error. The sequence \@ |
| encodes character code 0; the letters (in either case) encode charac- |
| ters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters 27-31 |
| (hex 1B to hex 1F), and \? becomes either 255 (hex FF) or 95 (hex 5F). |
| |
| Thus, apart from \?, these escapes generate the same character code |
| values as they do in an ASCII environment, though the meanings of the |
| values mostly differ. For example, \G always generates code value 7, |
| which is BEL in ASCII but DEL in EBCDIC. |
| |
| The sequence \? generates DEL (127, hex 7F) in an ASCII environment, |
| but because 127 is not a control character in EBCDIC, Perl makes it |
| generate the APC character. Unfortunately, there are several variants |
| of EBCDIC. In most of them the APC character has the value 255 (hex |
| FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If |
| certain other characters have POSIX-BC values, PCRE2 makes \? generate |
| 95; otherwise it generates 255. |
| |
| After \0 up to two further octal digits are read. If there are fewer |
| than two digits, just those that are present are used. Thus the |
| sequence \0\x\015 specifies two binary zeros followed by a CR character |
| (code value 13). Make sure you supply two digits after the initial zero |
| if the pattern character that follows is itself an octal digit. |
| |
| The escape \o must be followed by a sequence of octal digits, enclosed |
| in braces. An error occurs if this is not the case. This escape is a |
| recent addition to Perl; it provides way of specifying character code |
| points as octal numbers greater than 0777, and it also allows octal |
| numbers and back references to be unambiguously specified. |
| |
| For greater clarity and unambiguity, it is best to avoid following \ by |
| a digit greater than zero. Instead, use \o{} or \x{} to specify charac- |
| ter numbers, and \g{} to specify back references. The following para- |
| graphs describe the old, ambiguous syntax. |
| |
| The handling of a backslash followed by a digit other than 0 is compli- |
| cated, and Perl has changed over time, causing PCRE2 also to change. |
| |
| Outside a character class, PCRE2 reads the digit and any following dig- |
| its as a decimal number. If the number is less than 10, begins with the |
| digit 8 or 9, or if there are at least that many previous capturing |
| left parentheses in the expression, the entire sequence is taken as a |
| back reference. A description of how this works is given later, follow- |
| ing the discussion of parenthesized subpatterns. Otherwise, up to |
| three octal digits are read to form a character code. |
| |
| Inside a character class, PCRE2 handles \8 and \9 as the literal char- |
| acters "8" and "9", and otherwise reads up to three octal digits fol- |
| lowing the backslash, using them to generate a data character. Any sub- |
| sequent digits stand for themselves. For example, outside a character |
| class: |
| |
| \040 is another way of writing an ASCII space |
| \40 is the same, provided there are fewer than 40 |
| previous capturing subpatterns |
| \7 is always a back reference |
| \11 might be a back reference, or another way of |
| writing a tab |
| \011 is always a tab |
| \0113 is a tab followed by the character "3" |
| \113 might be a back reference, otherwise the |
| character with octal code 113 |
| \377 might be a back reference, otherwise |
| the value 255 (decimal) |
| \81 is always a back reference |
| |
| Note that octal values of 100 or greater that are specified using this |
| syntax must not be introduced by a leading zero, because no more than |
| three octal digits are ever read. |
| |
| By default, after \x that is not followed by {, from zero to two hexa- |
| decimal digits are read (letters can be in upper or lower case). Any |
| number of hexadecimal digits may appear between \x{ and }. If a charac- |
| ter other than a hexadecimal digit appears between \x{ and }, or if |
| there is no terminating }, an error occurs. |
| |
| If the PCRE2_ALT_BSUX option is set, the interpretation of \x is as |
| just described only when it is followed by two hexadecimal digits. Oth- |
| erwise, it matches a literal "x" character. In this mode mode, support |
| for code points greater than 256 is provided by \u, which must be fol- |
| lowed by four hexadecimal digits; otherwise it matches a literal "u" |
| character. |
| |
| Characters whose value is less than 256 can be defined by either of the |
| two syntaxes for \x (or by \u in PCRE2_ALT_BSUX mode). There is no dif- |
| ference in the way they are handled. For example, \xdc is exactly the |
| same as \x{dc} (or \u00dc in PCRE2_ALT_BSUX mode). |
| |
| Constraints on character values |
| |
| Characters that are specified using octal or hexadecimal numbers are |
| limited to certain values, as follows: |
| |
| 8-bit non-UTF mode less than 0x100 |
| 8-bit UTF-8 mode less than 0x10ffff and a valid codepoint |
| 16-bit non-UTF mode less than 0x10000 |
| 16-bit UTF-16 mode less than 0x10ffff and a valid codepoint |
| 32-bit non-UTF mode less than 0x100000000 |
| 32-bit UTF-32 mode less than 0x10ffff and a valid codepoint |
| |
| Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so- |
| called "surrogate" codepoints), and 0xffef. |
| |
| Escape sequences in character classes |
| |
| All the sequences that define a single character value can be used both |
| inside and outside character classes. In addition, inside a character |
| class, \b is interpreted as the backspace character (hex 08). |
| |
| \N is not allowed in a character class. \B, \R, and \X are not special |
| inside a character class. Like other unrecognized alphabetic escape |
| sequences, they cause an error. Outside a character class, these |
| sequences have different meanings. |
| |
| Unsupported escape sequences |
| |
| In Perl, the sequences \l, \L, \u, and \U are recognized by its string |
| handler and used to modify the case of following characters. By |
| default, PCRE2 does not support these escape sequences. However, if the |
| PCRE2_ALT_BSUX option is set, \U matches a "U" character, and \u can be |
| used to define a character by code point, as described in the previous |
| section. |
| |
| Absolute and relative back references |
| |
| The sequence \g followed by an unsigned or a negative number, option- |
| ally enclosed in braces, is an absolute or relative back reference. A |
| named back reference can be coded as \g{name}. Back references are dis- |
| cussed later, following the discussion of parenthesized subpatterns. |
| |
| Absolute and relative subroutine calls |
| |
| For compatibility with Oniguruma, the non-Perl syntax \g followed by a |
| name or a number enclosed either in angle brackets or single quotes, is |
| an alternative syntax for referencing a subpattern as a "subroutine". |
| Details are discussed later. Note that \g{...} (Perl syntax) and |
| \g<...> (Oniguruma syntax) are not synonymous. The former is a back |
| reference; the latter is a subroutine call. |
| |
| Generic character types |
| |
| Another use of backslash is for specifying generic character types: |
| |
| \d any decimal digit |
| \D any character that is not a decimal digit |
| \h any horizontal white space character |
| \H any character that is not a horizontal white space character |
| \s any white space character |
| \S any character that is not a white space character |
| \v any vertical white space character |
| \V any character that is not a vertical white space character |
| \w any "word" character |
| \W any "non-word" character |
| |
| There is also the single sequence \N, which matches a non-newline char- |
| acter. This is the same as the "." metacharacter when PCRE2_DOTALL is |
| not set. Perl also uses \N to match characters by name; PCRE2 does not |
| support this. |
| |
| Each pair of lower and upper case escape sequences partitions the com- |
| plete set of characters into two disjoint sets. Any given character |
| matches one, and only one, of each pair. The sequences can appear both |
| inside and outside character classes. They each match one character of |
| the appropriate type. If the current matching point is at the end of |
| the subject string, all of them fail, because there is no character to |
| match. |
| |
| The default \s characters are HT (9), LF (10), VT (11), FF (12), CR |
| (13), and space (32), which are defined as white space in the "C" |
| locale. This list may vary if locale-specific matching is taking place. |
| For example, in some locales the "non-breaking space" character (\xA0) |
| is recognized as white space, and in others the VT character is not. |
| |
| A "word" character is an underscore or any character that is a letter |
| or digit. By default, the definition of letters and digits is con- |
| trolled by PCRE2's low-valued character tables, and may vary if locale- |
| specific matching is taking place (see "Locale support" in the pcre2api |
| page). For example, in a French locale such as "fr_FR" in Unix-like |
| systems, or "french" in Windows, some character codes greater than 127 |
| are used for accented letters, and these are then matched by \w. The |
| use of locales with Unicode is discouraged. |
| |
| By default, characters whose code points are greater than 127 never |
| match \d, \s, or \w, and always match \D, \S, and \W, although this may |
| be different for characters in the range 128-255 when locale-specific |
| matching is happening. These escape sequences retain their original |
| meanings from before Unicode support was available, mainly for effi- |
| ciency reasons. If the PCRE2_UCP option is set, the behaviour is |
| changed so that Unicode properties are used to determine character |
| types, as follows: |
| |
| \d any character that matches \p{Nd} (decimal digit) |
| \s any character that matches \p{Z} or \h or \v |
| \w any character that matches \p{L} or \p{N}, plus underscore |
| |
| The upper case escapes match the inverse sets of characters. Note that |
| \d matches only decimal digits, whereas \w matches any Unicode digit, |
| as well as any Unicode letter, and underscore. Note also that PCRE2_UCP |
| affects \b, and \B because they are defined in terms of \w and \W. |
| Matching these sequences is noticeably slower when PCRE2_UCP is set. |
| |
| The sequences \h, \H, \v, and \V, in contrast to the other sequences, |
| which match only ASCII characters by default, always match a specific |
| list of code points, whether or not PCRE2_UCP is set. The horizontal |
| space characters are: |
| |
| U+0009 Horizontal tab (HT) |
| U+0020 Space |
| U+00A0 Non-break space |
| U+1680 Ogham space mark |
| U+180E Mongolian vowel separator |
| U+2000 En quad |
| U+2001 Em quad |
| U+2002 En space |
| U+2003 Em space |
| U+2004 Three-per-em space |
| U+2005 Four-per-em space |
| U+2006 Six-per-em space |
| U+2007 Figure space |
| U+2008 Punctuation space |
| U+2009 Thin space |
| U+200A Hair space |
| U+202F Narrow no-break space |
| U+205F Medium mathematical space |
| U+3000 Ideographic space |
| |
| The vertical space characters are: |
| |
| U+000A Linefeed (LF) |
| U+000B Vertical tab (VT) |
| U+000C Form feed (FF) |
| U+000D Carriage return (CR) |
| U+0085 Next line (NEL) |
| U+2028 Line separator |
| U+2029 Paragraph separator |
| |
| In 8-bit, non-UTF-8 mode, only the characters with code points less |
| than 256 are relevant. |
| |
| Newline sequences |
| |
| Outside a character class, by default, the escape sequence \R matches |
| any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent |
| to the following: |
| |
| (?>\r\n|\n|\x0b|\f|\r|\x85) |
| |
| This is an example of an "atomic group", details of which are given |
| below. This particular group matches either the two-character sequence |
| CR followed by LF, or one of the single characters LF (linefeed, |
| U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car- |
| riage return, U+000D), or NEL (next line, U+0085). Because this is an |
| atomic group, the two-character sequence is treated as a single unit |
| that cannot be split. |
| |
| In other modes, two additional characters whose codepoints are greater |
| than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- |
| rator, U+2029). Unicode support is not needed for these characters to |
| be recognized. |
| |
| It is possible to restrict \R to match only CR, LF, or CRLF (instead of |
| the complete set of Unicode line endings) by setting the option |
| PCRE2_BSR_ANYCRLF at compile time. (BSR is an abbrevation for "back- |
| slash R".) This can be made the default when PCRE2 is built; if this is |
| the case, the other behaviour can be requested via the PCRE2_BSR_UNI- |
| CODE option. It is also possible to specify these settings by starting |
| a pattern string with one of the following sequences: |
| |
| (*BSR_ANYCRLF) CR, LF, or CRLF only |
| (*BSR_UNICODE) any Unicode newline sequence |
| |
| These override the default and the options given to the compiling func- |
| tion. Note that these special settings, which are not Perl-compatible, |
| are recognized only at the very start of a pattern, and that they must |
| be in upper case. If more than one of them is present, the last one is |
| used. They can be combined with a change of newline convention; for |
| example, a pattern can start with: |
| |
| (*ANY)(*BSR_ANYCRLF) |
| |
| They can also be combined with the (*UTF) or (*UCP) special sequences. |
| Inside a character class, \R is treated as an unrecognized escape |
| sequence, and causes an error. |
| |
| Unicode character properties |
| |
| When PCRE2 is built with Unicode support (the default), three addi- |
| tional escape sequences that match characters with specific properties |
| are available. In 8-bit non-UTF-8 mode, these sequences are of course |
| limited to testing characters whose codepoints are less than 256, but |
| they do work in this mode. The extra escape sequences are: |
| |
| \p{xx} a character with the xx property |
| \P{xx} a character without the xx property |
| \X a Unicode extended grapheme cluster |
| |
| The property names represented by xx above are limited to the Unicode |
| script names, the general category properties, "Any", which matches any |
| character (including newline), and some special PCRE2 properties |
| (described in the next section). Other Perl properties such as "InMu- |
| sicalSymbols" are not supported by PCRE2. Note that \P{Any} does not |
| match any characters, so always causes a match failure. |
| |
| Sets of Unicode characters are defined as belonging to certain scripts. |
| A character from one of these sets can be matched using a script name. |
| For example: |
| |
| \p{Greek} |
| \P{Han} |
| |
| Those that are not part of an identified script are lumped together as |
| "Common". The current list of scripts is: |
| |
| Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Balinese, |
| Bamum, Bassa_Vah, Batak, Bengali, Bopomofo, Brahmi, Braille, Buginese, |
| Buhid, Canadian_Aboriginal, Carian, Caucasian_Albanian, Chakma, Cham, |
| Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, |
| Devanagari, Duployan, Egyptian_Hieroglyphs, Elbasan, Ethiopic, Geor- |
| gian, Glagolitic, Gothic, Grantha, Greek, Gujarati, Gurmukhi, Han, |
| Hangul, Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited, |
| Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kan- |
| nada, Katakana, Kayah_Li, Kharoshthi, Khmer, Khojki, Khudawadi, Lao, |
| Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Maha- |
| jani, Malayalam, Mandaic, Manichaean, Meetei_Mayek, Mende_Kikakui, |
| Meroitic_Cursive, Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, |
| Multani, Myanmar, Nabataean, New_Tai_Lue, Nko, Ogham, Ol_Chiki, |
| Old_Hungarian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, |
| Old_South_Arabian, Old_Turkic, Oriya, Osmanya, Pahawh_Hmong, Palmyrene, |
| Pau_Cin_Hau, Phags_Pa, Phoenician, Psalter_Pahlavi, Rejang, Runic, |
| Samaritan, Saurashtra, Sharada, Shavian, Siddham, SignWriting, Sinhala, |
| Sora_Sompeng, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, |
| Tai_Le, Tai_Tham, Tai_Viet, Takri, Tamil, Telugu, Thaana, Thai, |
| Tibetan, Tifinagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi. |
| |
| Each character has exactly one Unicode general category property, spec- |
| ified by a two-letter abbreviation. For compatibility with Perl, nega- |
| tion can be specified by including a circumflex between the opening |
| brace and the property name. For example, \p{^Lu} is the same as |
| \P{Lu}. |
| |
| If only one letter is specified with \p or \P, it includes all the gen- |
| eral category properties that start with that letter. In this case, in |
| the absence of negation, the curly brackets in the escape sequence are |
| optional; these two examples have the same effect: |
| |
| \p{L} |
| \pL |
| |
| The following general category property codes are supported: |
| |
| C Other |
| Cc Control |
| Cf Format |
| Cn Unassigned |
| Co Private use |
| Cs Surrogate |
| |
| L Letter |
| Ll Lower case letter |
| Lm Modifier letter |
| Lo Other letter |
| Lt Title case letter |
| Lu Upper case letter |
| |
| M Mark |
| Mc Spacing mark |
| Me Enclosing mark |
| Mn Non-spacing mark |
| |
| N Number |
| Nd Decimal number |
| Nl Letter number |
| No Other number |
| |
| P Punctuation |
| Pc Connector punctuation |
| Pd Dash punctuation |
| Pe Close punctuation |
| Pf Final punctuation |
| Pi Initial punctuation |
| Po Other punctuation |
| Ps Open punctuation |
| |
| S Symbol |
| Sc Currency symbol |
| Sk Modifier symbol |
| Sm Mathematical symbol |
| So Other symbol |
| |
| Z Separator |
| Zl Line separator |
| Zp Paragraph separator |
| Zs Space separator |
| |
| The special property L& is also supported: it matches a character that |
| has the Lu, Ll, or Lt property, in other words, a letter that is not |
| classified as a modifier or "other". |
| |
| The Cs (Surrogate) property applies only to characters in the range |
| U+D800 to U+DFFF. Such characters are not valid in Unicode strings and |
| so cannot be tested by PCRE2, unless UTF validity checking has been |
| turned off (see the discussion of PCRE2_NO_UTF_CHECK in the pcre2api |
| page). Perl does not support the Cs property. |
| |
| The long synonyms for property names that Perl supports (such as |
| \p{Letter}) are not supported by PCRE2, nor is it permitted to prefix |
| any of these properties with "Is". |
| |
| No character that is in the Unicode table has the Cn (unassigned) prop- |
| erty. Instead, this property is assumed for any code point that is not |
| in the Unicode table. |
| |
| Specifying caseless matching does not affect these escape sequences. |
| For example, \p{Lu} always matches only upper case letters. This is |
| different from the behaviour of current versions of Perl. |
| |
| Matching characters by Unicode property is not fast, because PCRE2 has |
| to do a multistage table lookup in order to find a character's prop- |
| erty. That is why the traditional escape sequences such as \d and \w do |
| not use Unicode properties in PCRE2 by default, though you can make |
| them do so by setting the PCRE2_UCP option or by starting the pattern |
| with (*UCP). |
| |
| Extended grapheme clusters |
| |
| The \X escape matches any number of Unicode characters that form an |
| "extended grapheme cluster", and treats the sequence as an atomic group |
| (see below). Unicode supports various kinds of composite character by |
| giving each character a grapheme breaking property, and having rules |
| that use these properties to define the boundaries of extended grapheme |
| clusters. \X always matches at least one character. Then it decides |
| whether to add additional characters according to the following rules |
| for ending a cluster: |
| |
| 1. End at the end of the subject string. |
| |
| 2. Do not end between CR and LF; otherwise end after any control char- |
| acter. |
| |
| 3. Do not break Hangul (a Korean script) syllable sequences. Hangul |
| characters are of five types: L, V, T, LV, and LVT. An L character may |
| be followed by an L, V, LV, or LVT character; an LV or V character may |
| be followed by a V or T character; an LVT or T character may be follwed |
| only by a T character. |
| |
| 4. Do not end before extending characters or spacing marks. Characters |
| with the "mark" property always have the "extend" grapheme breaking |
| property. |
| |
| 5. Do not end after prepend characters. |
| |
| 6. Otherwise, end the cluster. |
| |
| PCRE2's additional properties |
| |
| As well as the standard Unicode properties described above, PCRE2 sup- |
| ports four more that make it possible to convert traditional escape |
| sequences such as \w and \s to use Unicode properties. PCRE2 uses these |
| non-standard, non-Perl properties internally when PCRE2_UCP is set. |
| However, they may also be used explicitly. These properties are: |
| |
| Xan Any alphanumeric character |
| Xps Any POSIX space character |
| Xsp Any Perl space character |
| Xwd Any Perl "word" character |
| |
| Xan matches characters that have either the L (letter) or the N (num- |
| ber) property. Xps matches the characters tab, linefeed, vertical tab, |
| form feed, or carriage return, and any other character that has the Z |
| (separator) property. Xsp is the same as Xps; in PCRE1 it used to |
| exclude vertical tab, for Perl compatibility, but Perl changed. Xwd |
| matches the same characters as Xan, plus underscore. |
| |
| There is another non-standard property, Xuc, which matches any charac- |
| ter that can be represented by a Universal Character Name in C++ and |
| other programming languages. These are the characters $, @, ` (grave |
| accent), and all characters with Unicode code points greater than or |
| equal to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that |
| most base (ASCII) characters are excluded. (Universal Character Names |
| are of the form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit. |
| Note that the Xuc property does not match these sequences but the char- |
| acters that they represent.) |
| |
| Resetting the match start |
| |
| The escape sequence \K causes any previously matched characters not to |
| be included in the final matched sequence. For example, the pattern: |
| |
| foo\Kbar |
| |
| matches "foobar", but reports that it has matched "bar". This feature |
| is similar to a lookbehind assertion (described below). However, in |
| this case, the part of the subject before the real match does not have |
| to be of fixed length, as lookbehind assertions do. The use of \K does |
| not interfere with the setting of captured substrings. For example, |
| when the pattern |
| |
| (foo)\Kbar |
| |
| matches "foobar", the first substring is still set to "foo". |
| |
| Perl documents that the use of \K within assertions is "not well |
| defined". In PCRE2, \K is acted upon when it occurs inside positive |
| assertions, but is ignored in negative assertions. Note that when a |
| pattern such as (?=ab\K) matches, the reported start of the match can |
| be greater than the end of the match. |
| |
| Simple assertions |
| |
| The final use of backslash is for certain simple assertions. An asser- |
| tion specifies a condition that has to be met at a particular point in |
| a match, without consuming any characters from the subject string. The |
| use of subpatterns for more complicated assertions is described below. |
| The backslashed assertions are: |
| |
| \b matches at a word boundary |
| \B matches when not at a word boundary |
| \A matches at the start of the subject |
| \Z matches at the end of the subject |
| also matches before a newline at the end of the subject |
| \z matches only at the end of the subject |
| \G matches at the first matching position in the subject |
| |
| Inside a character class, \b has a different meaning; it matches the |
| backspace character. If any other of these assertions appears in a |
| character class, an "invalid escape sequence" error is generated. |
| |
| A word boundary is a position in the subject string where the current |
| character and the previous character do not both match \w or \W (i.e. |
| one matches \w and the other matches \W), or the start or end of the |
| string if the first or last character matches \w, respectively. In a |
| UTF mode, the meanings of \w and \W can be changed by setting the |
| PCRE2_UCP option. When this is done, it also affects \b and \B. Neither |
| PCRE2 nor Perl has a separate "start of word" or "end of word" metase- |
| quence. However, whatever follows \b normally determines which it is. |
| For example, the fragment \ba matches "a" at the start of a word. |
| |
| The \A, \Z, and \z assertions differ from the traditional circumflex |
| and dollar (described in the next section) in that they only ever match |
| at the very start and end of the subject string, whatever options are |
| set. Thus, they are independent of multiline mode. These three asser- |
| tions are not affected by the PCRE2_NOTBOL or PCRE2_NOTEOL options, |
| which affect only the behaviour of the circumflex and dollar metachar- |
| acters. However, if the startoffset argument of pcre2_match() is non- |
| zero, indicating that matching is to start at a point other than the |
| beginning of the subject, \A can never match. The difference between |
| \Z and \z is that \Z matches before a newline at the end of the string |
| as well as at the very end, whereas \z matches only at the end. |
| |
| The \G assertion is true only when the current matching position is at |
| the start point of the match, as specified by the startoffset argument |
| of pcre2_match(). It differs from \A when the value of startoffset is |
| non-zero. By calling pcre2_match() multiple times with appropriate |
| arguments, you can mimic Perl's /g option, and it is in this kind of |
| implementation where \G can be useful. |
| |
| Note, however, that PCRE2's interpretation of \G, as the start of the |
| current match, is subtly different from Perl's, which defines it as the |
| end of the previous match. In Perl, these can be different when the |
| previously matched string was empty. Because PCRE2 does just one match |
| at a time, it cannot reproduce this behaviour. |
| |
| If all the alternatives of a pattern begin with \G, the expression is |
| anchored to the starting match position, and the "anchored" flag is set |
| in the compiled regular expression. |
| |
| |
| CIRCUMFLEX AND DOLLAR |
| |
| The circumflex and dollar metacharacters are zero-width assertions. |
| That is, they test for a particular condition being true without con- |
| suming any characters from the subject string. These two metacharacters |
| are concerned with matching the starts and ends of lines. If the new- |
| line convention is set so that only the two-character sequence CRLF is |
| recognized as a newline, isolated CR and LF characters are treated as |
| ordinary data characters, and are not recognized as newlines. |
| |
| Outside a character class, in the default matching mode, the circumflex |
| character is an assertion that is true only if the current matching |
| point is at the start of the subject string. If the startoffset argu- |
| ment of pcre2_match() is non-zero, or if PCRE2_NOTBOL is set, circum- |
| flex can never match if the PCRE2_MULTILINE option is unset. Inside a |
| character class, circumflex has an entirely different meaning (see |
| below). |
| |
| Circumflex need not be the first character of the pattern if a number |
| of alternatives are involved, but it should be the first thing in each |
| alternative in which it appears if the pattern is ever to match that |
| branch. If all possible alternatives start with a circumflex, that is, |
| if the pattern is constrained to match only at the start of the sub- |
| ject, it is said to be an "anchored" pattern. (There are also other |
| constructs that can cause a pattern to be anchored.) |
| |
| The dollar character is an assertion that is true only if the current |
| matching point is at the end of the subject string, or immediately |
| before a newline at the end of the string (by default), unless |
| PCRE2_NOTEOL is set. Note, however, that it does not actually match the |
| newline. Dollar need not be the last character of the pattern if a num- |
| ber of alternatives are involved, but it should be the last item in any |
| branch in which it appears. Dollar has no special meaning in a charac- |
| ter class. |
| |
| The meaning of dollar can be changed so that it matches only at the |
| very end of the string, by setting the PCRE2_DOLLAR_ENDONLY option at |
| compile time. This does not affect the \Z assertion. |
| |
| The meanings of the circumflex and dollar metacharacters are changed if |
| the PCRE2_MULTILINE option is set. When this is the case, a dollar |
| character matches before any newlines in the string, as well as at the |
| very end, and a circumflex matches immediately after internal newlines |
| as well as at the start of the subject string. It does not match after |
| a newline that ends the string, for compatibility with Perl. However, |
| this can be changed by setting the PCRE2_ALT_CIRCUMFLEX option. |
| |
| For example, the pattern /^abc$/ matches the subject string "def\nabc" |
| (where \n represents a newline) in multiline mode, but not otherwise. |
| Consequently, patterns that are anchored in single line mode because |
| all branches start with ^ are not anchored in multiline mode, and a |
| match for circumflex is possible when the startoffset argument of |
| pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option is ignored |
| if PCRE2_MULTILINE is set. |
| |
| When the newline convention (see "Newline conventions" below) recog- |
| nizes the two-character sequence CRLF as a newline, this is preferred, |
| even if the single characters CR and LF are also recognized as new- |
| lines. For example, if the newline convention is "any", a multiline |
| mode circumflex matches before "xyz" in the string "abc\r\nxyz" rather |
| than after CR, even though CR on its own is a valid newline. (It also |
| matches at the very start of the string, of course.) |
| |
| Note that the sequences \A, \Z, and \z can be used to match the start |
| and end of the subject in both modes, and if all branches of a pattern |
| start with \A it is always anchored, whether or not PCRE2_MULTILINE is |
| set. |
| |
| |
| FULL STOP (PERIOD, DOT) AND \N |
| |
| Outside a character class, a dot in the pattern matches any one charac- |
| ter in the subject string except (by default) a character that signi- |
| fies the end of a line. |
| |
| When a line ending is defined as a single character, dot never matches |
| that character; when the two-character sequence CRLF is used, dot does |
| not match CR if it is immediately followed by LF, but otherwise it |
| matches all characters (including isolated CRs and LFs). When any Uni- |
| code line endings are being recognized, dot does not match CR or LF or |
| any of the other line ending characters. |
| |
| The behaviour of dot with regard to newlines can be changed. If the |
| PCRE2_DOTALL option is set, a dot matches any one character, without |
| exception. If the two-character sequence CRLF is present in the sub- |
| ject string, it takes two dots to match it. |
| |
| The handling of dot is entirely independent of the handling of circum- |
| flex and dollar, the only relationship being that they both involve |
| newlines. Dot has no special meaning in a character class. |
| |
| The escape sequence \N behaves like a dot, except that it is not |
| affected by the PCRE2_DOTALL option. In other words, it matches any |
| character except one that signifies the end of a line. Perl also uses |
| \N to match characters by name; PCRE2 does not support this. |
| |
| |
| MATCHING A SINGLE CODE UNIT |
| |
| Outside a character class, the escape sequence \C matches any one code |
| unit, whether or not a UTF mode is set. In the 8-bit library, one code |
| unit is one byte; in the 16-bit library it is a 16-bit unit; in the |
| 32-bit library it is a 32-bit unit. Unlike a dot, \C always matches |
| line-ending characters. The feature is provided in Perl in order to |
| match individual bytes in UTF-8 mode, but it is unclear how it can use- |
| fully be used. |
| |
| Because \C breaks up characters into individual code units, matching |
| one unit with \C in UTF-8 or UTF-16 mode means that the rest of the |
| string may start with a malformed UTF character. This has undefined |
| results, because PCRE2 assumes that it is matching character by charac- |
| ter in a valid UTF string (by default it checks the subject string's |
| validity at the start of processing unless the PCRE2_NO_UTF_CHECK |
| option is used). |
| |
| An application can lock out the use of \C by setting the |
| PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also |
| possible to build PCRE2 with the use of \C permanently disabled. |
| |
| PCRE2 does not allow \C to appear in lookbehind assertions (described |
| below) in a UTF mode, because this would make it impossible to calcu- |
| late the length of the lookbehind. Neither the alternative matching |
| function pcre2_dfa_match() not the JIT optimizer support \C in a UTF |
| mode. The former gives a match-time error; the latter fails to optimize |
| and so the match is always run using the interpreter. |
| |
| In general, the \C escape sequence is best avoided. However, one way of |
| using it that avoids the problem of malformed UTF characters is to use |
| a lookahead to check the length of the next character, as in this pat- |
| tern, which could be used with a UTF-8 string (ignore white space and |
| line breaks): |
| |
| (?| (?=[\x00-\x7f])(\C) | |
| (?=[\x80-\x{7ff}])(\C)(\C) | |
| (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) | |
| (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C)) |
| |
| In this example, a group that starts with (?| resets the capturing |
| parentheses numbers in each alternative (see "Duplicate Subpattern Num- |
| bers" below). The assertions at the start of each branch check the next |
| UTF-8 character for values whose encoding uses 1, 2, 3, or 4 bytes, |
| respectively. The character's individual bytes are then captured by the |
| appropriate number of \C groups. |
| |
| |
| SQUARE BRACKETS AND CHARACTER CLASSES |
| |
| An opening square bracket introduces a character class, terminated by a |
| closing square bracket. A closing square bracket on its own is not spe- |
| cial by default. If a closing square bracket is required as a member |
| of the class, it should be the first data character in the class (after |
| an initial circumflex, if present) or escaped with a backslash. This |
| means that, by default, an empty class cannot be defined. However, if |
| the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket at |
| the start does end the (empty) class. |
| |
| A character class matches a single character in the subject. A matched |
| character must be in the set of characters defined by the class, unless |
| the first character in the class definition is a circumflex, in which |
| case the subject character must not be in the set defined by the class. |
| If a circumflex is actually required as a member of the class, ensure |
| it is not the first character, or escape it with a backslash. |
| |
| For example, the character class [aeiou] matches any lower case vowel, |
| while [^aeiou] matches any character that is not a lower case vowel. |
| Note that a circumflex is just a convenient notation for specifying the |
| characters that are in the class by enumerating those that are not. A |
| class that starts with a circumflex is not an assertion; it still con- |
| sumes a character from the subject string, and therefore it fails if |
| the current pointer is at the end of the string. |
| |
| When caseless matching is set, any letters in a class represent both |
| their upper case and lower case versions, so for example, a caseless |
| [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not |
| match "A", whereas a caseful version would. |
| |
| Characters that might indicate line breaks are never treated in any |
| special way when matching character classes, whatever line-ending |
| sequence is in use, and whatever setting of the PCRE2_DOTALL and |
| PCRE2_MULTILINE options is used. A class such as [^a] always matches |
| one of these characters. |
| |
| The minus (hyphen) character can be used to specify a range of charac- |
| ters in a character class. For example, [d-m] matches any letter |
| between d and m, inclusive. If a minus character is required in a |
| class, it must be escaped with a backslash or appear in a position |
| where it cannot be interpreted as indicating a range, typically as the |
| first or last character in the class, or immediately after a range. For |
| example, [b-d-z] matches letters in the range b to d, a hyphen charac- |
| ter, or z. |
| |
| It is not possible to have the literal character "]" as the end charac- |
| ter of a range. A pattern such as [W-]46] is interpreted as a class of |
| two characters ("W" and "-") followed by a literal string "46]", so it |
| would match "W46]" or "-46]". However, if the "]" is escaped with a |
| backslash it is interpreted as the end of range, so [W-\]46] is inter- |
| preted as a class containing a range followed by two other characters. |
| The octal or hexadecimal representation of "]" can also be used to end |
| a range. |
| |
| An error is generated if a POSIX character class (see below) or an |
| escape sequence other than one that defines a single character appears |
| at a point where a range ending character is expected. For example, |
| [z-\xff] is valid, but [A-\d] and [A-[:digit:]] are not. |
| |
| Ranges normally include all code points between the start and end char- |
| acters, inclusive. They can also be used for code points specified |
| numerically, for example [\000-\037]. Ranges can include any characters |
| that are valid for the current mode. |
| |
| There is a special case in EBCDIC environments for ranges whose end |
| points are both specified as literal letters in the same case. For com- |
| patibility with Perl, EBCDIC code points within the range that are not |
| letters are omitted. For example, [h-k] matches only four characters, |
| even though the codes for h and k are 0x88 and 0x92, a range of 11 code |
| points. However, if the range is specified numerically, for example, |
| [\x88-\x92] or [h-\x92], all code points are included. |
| |
| If a range that includes letters is used when caseless matching is set, |
| it matches the letters in either case. For example, [W-c] is equivalent |
| to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if |
| character tables for a French locale are in use, [\xc8-\xcb] matches |
| accented E characters in both cases. |
| |
| The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V, |
| \w, and \W may appear in a character class, and add the characters that |
| they match to the class. For example, [\dABCDEF] matches any hexadeci- |
| mal digit. In UTF modes, the PCRE2_UCP option affects the meanings of |
| \d, \s, \w and their upper case partners, just as it does when they |
| appear outside a character class, as described in the section entitled |
| "Generic character types" above. The escape sequence \b has a different |
| meaning inside a character class; it matches the backspace character. |
| The sequences \B, \N, \R, and \X are not special inside a character |
| class. Like any other unrecognized escape sequences, they cause an |
| error. |
| |
| A circumflex can conveniently be used with the upper case character |
| types to specify a more restricted set of characters than the matching |
| lower case type. For example, the class [^\W_] matches any letter or |
| digit, but not underscore, whereas [\w] includes underscore. A positive |
| character class should be read as "something OR something OR ..." and a |
| negative class as "NOT something AND NOT something AND NOT ...". |
| |
| The only metacharacters that are recognized in character classes are |
| backslash, hyphen (only where it can be interpreted as specifying a |
| range), circumflex (only at the start), opening square bracket (only |
| when it can be interpreted as introducing a POSIX class name, or for a |
| special compatibility feature - see the next two sections), and the |
| terminating closing square bracket. However, escaping other non- |
| alphanumeric characters does no harm. |
| |
| |
| POSIX CHARACTER CLASSES |
| |
| Perl supports the POSIX notation for character classes. This uses names |
| enclosed by [: and :] within the enclosing square brackets. PCRE2 also |
| supports this notation. For example, |
| |
| [01[:alpha:]%] |
| |
| matches "0", "1", any alphabetic character, or "%". The supported class |
| names are: |
| |
| alnum letters and digits |
| alpha letters |
| ascii character codes 0 - 127 |
| blank space or tab only |
| cntrl control characters |
| digit decimal digits (same as \d) |
| graph printing characters, excluding space |
| lower lower case letters |
| print printing characters, including space |
| punct printing characters, excluding letters and digits and space |
| space white space (the same as \s from PCRE2 8.34) |
| upper upper case letters |
| word "word" characters (same as \w) |
| xdigit hexadecimal digits |
| |
| The default "space" characters are HT (9), LF (10), VT (11), FF (12), |
| CR (13), and space (32). If locale-specific matching is taking place, |
| the list of space characters may be different; there may be fewer or |
| more of them. "Space" and \s match the same set of characters. |
| |
| The name "word" is a Perl extension, and "blank" is a GNU extension |
| from Perl 5.8. Another Perl extension is negation, which is indicated |
| by a ^ character after the colon. For example, |
| |
| [12[:^digit:]] |
| |
| matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the |
| POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but |
| these are not supported, and an error is given if they are encountered. |
| |
| By default, characters with values greater than 127 do not match any of |
| the POSIX character classes, although this may be different for charac- |
| ters in the range 128-255 when locale-specific matching is happening. |
| However, if the PCRE2_UCP option is passed to pcre2_compile(), some of |
| the classes are changed so that Unicode character properties are used. |
| This is achieved by replacing certain POSIX classes with other |
| sequences, as follows: |
| |
| [:alnum:] becomes \p{Xan} |
| [:alpha:] becomes \p{L} |
| [:blank:] becomes \h |
| [:cntrl:] becomes \p{Cc} |
| [:digit:] becomes \p{Nd} |
| [:lower:] becomes \p{Ll} |
| [:space:] becomes \p{Xps} |
| [:upper:] becomes \p{Lu} |
| [:word:] becomes \p{Xwd} |
| |
| Negated versions, such as [:^alpha:] use \P instead of \p. Three other |
| POSIX classes are handled specially in UCP mode: |
| |
| [:graph:] This matches characters that have glyphs that mark the page |
| when printed. In Unicode property terms, it matches all char- |
| acters with the L, M, N, P, S, or Cf properties, except for: |
| |
| U+061C Arabic Letter Mark |
| U+180E Mongolian Vowel Separator |
| U+2066 - U+2069 Various "isolate"s |
| |
| |
| [:print:] This matches the same characters as [:graph:] plus space |
| characters that are not controls, that is, characters with |
| the Zs property. |
| |
| [:punct:] This matches all characters that have the Unicode P (punctua- |
| tion) property, plus those characters with code points less |
| than 256 that have the S (Symbol) property. |
| |
| The other POSIX classes are unchanged, and match only characters with |
| code points less than 256. |
| |
| |
| COMPATIBILITY FEATURE FOR WORD BOUNDARIES |
| |
| In the POSIX.2 compliant library that was included in 4.4BSD Unix, the |
| ugly syntax [[:<:]] and [[:>:]] is used for matching "start of word" |
| and "end of word". PCRE2 treats these items as follows: |
| |
| [[:<:]] is converted to \b(?=\w) |
| [[:>:]] is converted to \b(?<=\w) |
| |
| Only these exact character sequences are recognized. A sequence such as |
| [a[:<:]b] provokes error for an unrecognized POSIX class name. This |
| support is not compatible with Perl. It is provided to help migrations |
| from other environments, and is best not used in any new patterns. Note |
| that \b matches at the start and the end of a word (see "Simple asser- |
| tions" above), and in a Perl-style pattern the preceding or following |
| character normally shows which is wanted, without the need for the |
| assertions that are used above in order to give exactly the POSIX be- |
| haviour. |
| |
| |
| VERTICAL BAR |
| |
| Vertical bar characters are used to separate alternative patterns. For |
| example, the pattern |
| |
| gilbert|sullivan |
| |
| matches either "gilbert" or "sullivan". Any number of alternatives may |
| appear, and an empty alternative is permitted (matching the empty |
| string). The matching process tries each alternative in turn, from left |
| to right, and the first one that succeeds is used. If the alternatives |
| are within a subpattern (defined below), "succeeds" means matching the |
| rest of the main pattern as well as the alternative in the subpattern. |
| |
| |
| INTERNAL OPTION SETTING |
| |
| The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL, and |
| PCRE2_EXTENDED options (which are Perl-compatible) can be changed from |
| within the pattern by a sequence of Perl option letters enclosed |
| between "(?" and ")". The option letters are |
| |
| i for PCRE2_CASELESS |
| m for PCRE2_MULTILINE |
| s for PCRE2_DOTALL |
| x for PCRE2_EXTENDED |
| |
| For example, (?im) sets caseless, multiline matching. It is also possi- |
| ble to unset these options by preceding the letter with a hyphen, and a |
| combined setting and unsetting such as (?im-sx), which sets PCRE2_CASE- |
| LESS and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and |
| PCRE2_EXTENDED, is also permitted. If a letter appears both before and |
| after the hyphen, the option is unset. An empty options setting "(?)" |
| is allowed. Needless to say, it has no effect. |
| |
| The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be |
| changed in the same way as the Perl-compatible options by using the |
| characters J and U respectively. |
| |
| When one of these option changes occurs at top level (that is, not |
| inside subpattern parentheses), the change applies to the remainder of |
| the pattern that follows. If the change is placed right at the start of |
| a pattern, PCRE2 extracts it into the global options (and it will |
| therefore show up in data extracted by the pcre2_pattern_info() func- |
| tion). |
| |
| An option change within a subpattern (see below for a description of |
| subpatterns) affects only that part of the subpattern that follows it, |
| so |
| |
| (a(?i)b)c |
| |
| matches abc and aBc and no other strings (assuming PCRE2_CASELESS is |
| not used). By this means, options can be made to have different set- |
| tings in different parts of the pattern. Any changes made in one alter- |
| native do carry on into subsequent branches within the same subpattern. |
| For example, |
| |
| (a(?i)b|c) |
| |
| matches "ab", "aB", "c", and "C", even though when matching "C" the |
| first branch is abandoned before the option setting. This is because |
| the effects of option settings happen at compile time. There would be |
| some very weird behaviour otherwise. |
| |
| As a convenient shorthand, if any option settings are required at the |
| start of a non-capturing subpattern (see the next section), the option |
| letters may appear between the "?" and the ":". Thus the two patterns |
| |
| (?i:saturday|sunday) |
| (?:(?i)saturday|sunday) |
| |
| match exactly the same set of strings. |
| |
| Note: There are other PCRE2-specific options that can be set by the |
| application when the compiling function is called. The pattern can con- |
| tain special leading sequences such as (*CRLF) to override what the |
| application has set or what has been defaulted. Details are given in |
| the section entitled "Newline sequences" above. There are also the |
| (*UTF) and (*UCP) leading sequences that can be used to set UTF and |
| Unicode property modes; they are equivalent to setting the PCRE2_UTF |
| and PCRE2_UCP options, respectively. However, the application can set |
| the PCRE2_NEVER_UTF and PCRE2_NEVER_UCP options, which lock out the use |
| of the (*UTF) and (*UCP) sequences. |
| |
| |
| SUBPATTERNS |
| |
| Subpatterns are delimited by parentheses (round brackets), which can be |
| nested. Turning part of a pattern into a subpattern does two things: |
| |
| 1. It localizes a set of alternatives. For example, the pattern |
| |
| cat(aract|erpillar|) |
| |
| matches "cataract", "caterpillar", or "cat". Without the parentheses, |
| it would match "cataract", "erpillar" or an empty string. |
| |
| 2. It sets up the subpattern as a capturing subpattern. This means |
| that, when the whole pattern matches, the portion of the subject string |
| that matched the subpattern is passed back to the caller, separately |
| from the portion that matched the whole pattern. (This applies only to |
| the traditional matching function; the DFA matching function does not |
| support capturing.) |
| |
| Opening parentheses are counted from left to right (starting from 1) to |
| obtain numbers for the capturing subpatterns. For example, if the |
| string "the red king" is matched against the pattern |
| |
| the ((red|white) (king|queen)) |
| |
| the captured substrings are "red king", "red", and "king", and are num- |
| bered 1, 2, and 3, respectively. |
| |
| The fact that plain parentheses fulfil two functions is not always |
| helpful. There are often times when a grouping subpattern is required |
| without a capturing requirement. If an opening parenthesis is followed |
| by a question mark and a colon, the subpattern does not do any captur- |
| ing, and is not counted when computing the number of any subsequent |
| capturing subpatterns. For example, if the string "the white queen" is |
| matched against the pattern |
| |
| the ((?:red|white) (king|queen)) |
| |
| the captured substrings are "white queen" and "queen", and are numbered |
| 1 and 2. The maximum number of capturing subpatterns is 65535. |
| |
| As a convenient shorthand, if any option settings are required at the |
| start of a non-capturing subpattern, the option letters may appear |
| between the "?" and the ":". Thus the two patterns |
| |
| (?i:saturday|sunday) |
| (?:(?i)saturday|sunday) |
| |
| match exactly the same set of strings. Because alternative branches are |
| tried from left to right, and options are not reset until the end of |
| the subpattern is reached, an option setting in one branch does affect |
| subsequent branches, so the above patterns match "SUNDAY" as well as |
| "Saturday". |
| |
| |
| DUPLICATE SUBPATTERN NUMBERS |
| |
| Perl 5.10 introduced a feature whereby each alternative in a subpattern |
| uses the same numbers for its capturing parentheses. Such a subpattern |
| starts with (?| and is itself a non-capturing subpattern. For example, |
| consider this pattern: |
| |
| (?|(Sat)ur|(Sun))day |
| |
| Because the two alternatives are inside a (?| group, both sets of cap- |
| turing parentheses are numbered one. Thus, when the pattern matches, |
| you can look at captured substring number one, whichever alternative |
| matched. This construct is useful when you want to capture part, but |
| not all, of one of a number of alternatives. Inside a (?| group, paren- |
| theses are numbered as usual, but the number is reset at the start of |
| each branch. The numbers of any capturing parentheses that follow the |
| subpattern start after the highest number used in any branch. The fol- |
| lowing example is taken from the Perl documentation. The numbers under- |
| neath show in which buffer the captured content will be stored. |
| |
| # before ---------------branch-reset----------- after |
| / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x |
| # 1 2 2 3 2 3 4 |
| |
| A back reference to a numbered subpattern uses the most recent value |
| that is set for that number by any subpattern. The following pattern |
| matches "abcabc" or "defdef": |
| |
| /(?|(abc)|(def))\1/ |
| |
| In contrast, a subroutine call to a numbered subpattern always refers |
| to the first one in the pattern with the given number. The following |
| pattern matches "abcabc" or "defabc": |
| |
| /(?|(abc)|(def))(?1)/ |
| |
| A relative reference such as (?-1) is no different: it is just a conve- |
| nient way of computing an absolute group number. |
| |
| If a condition test for a subpattern's having matched refers to a non- |
| unique number, the test is true if any of the subpatterns of that num- |
| ber have matched. |
| |
| An alternative approach to using this "branch reset" feature is to use |
| duplicate named subpatterns, as described in the next section. |
| |
| |
| NAMED SUBPATTERNS |
| |
| Identifying capturing parentheses by number is simple, but it can be |
| very hard to keep track of the numbers in complicated regular expres- |
| sions. Furthermore, if an expression is modified, the numbers may |
| change. To help with this difficulty, PCRE2 supports the naming of sub- |
| patterns. This feature was not added to Perl until release 5.10. Python |
| had the feature earlier, and PCRE1 introduced it at release 4.0, using |
| the Python syntax. PCRE2 supports both the Perl and the Python syntax. |
| Perl allows identically numbered subpatterns to have different names, |
| but PCRE2 does not. |
| |
| In PCRE2, a subpattern can be named in one of three ways: (?<name>...) |
| or (?'name'...) as in Perl, or (?P<name>...) as in Python. References |
| to capturing parentheses from other parts of the pattern, such as back |
| references, recursion, and conditions, can be made by name as well as |
| by number. |
| |
| Names consist of up to 32 alphanumeric characters and underscores, but |
| must start with a non-digit. Named capturing parentheses are still |
| allocated numbers as well as names, exactly as if the names were not |
| present. The PCRE2 API provides function calls for extracting the name- |
| to-number translation table from a compiled pattern. There are also |
| convenience functions for extracting a captured substring by name. |
| |
| By default, a name must be unique within a pattern, but it is possible |
| to relax this constraint by setting the PCRE2_DUPNAMES option at com- |
| pile time. (Duplicate names are also always permitted for subpatterns |
| with the same number, set up as described in the previous section.) |
| Duplicate names can be useful for patterns where only one instance of |
| the named parentheses can match. Suppose you want to match the name of |
| a weekday, either as a 3-letter abbreviation or as the full name, and |
| in both cases you want to extract the abbreviation. This pattern |
| (ignoring the line breaks) does the job: |
| |
| (?<DN>Mon|Fri|Sun)(?:day)?| |
| (?<DN>Tue)(?:sday)?| |
| (?<DN>Wed)(?:nesday)?| |
| (?<DN>Thu)(?:rsday)?| |
| (?<DN>Sat)(?:urday)? |
| |
| There are five capturing substrings, but only one is ever set after a |
| match. (An alternative way of solving this problem is to use a "branch |
| reset" subpattern, as described in the previous section.) |
| |
| The convenience functions for extracting the data by name returns the |
| substring for the first (and in this example, the only) subpattern of |
| that name that matched. This saves searching to find which numbered |
| subpattern it was. |
| |
| If you make a back reference to a non-unique named subpattern from |
| elsewhere in the pattern, the subpatterns to which the name refers are |
| checked in the order in which they appear in the overall pattern. The |
| first one that is set is used for the reference. For example, this pat- |
| tern matches both "foofoo" and "barbar" but not "foobar" or "barfoo": |
| |
| (?:(?<n>foo)|(?<n>bar))\k<n> |
| |
| |
| If you make a subroutine call to a non-unique named subpattern, the one |
| that corresponds to the first occurrence of the name is used. In the |
| absence of duplicate numbers (see the previous section) this is the one |
| with the lowest number. |
| |
| If you use a named reference in a condition test (see the section about |
| conditions below), either to check whether a subpattern has matched, or |
| to check for recursion, all subpatterns with the same name are tested. |
| If the condition is true for any one of them, the overall condition is |
| true. This is the same behaviour as testing by number. For further |
| details of the interfaces for handling named subpatterns, see the |
| pcre2api documentation. |
| |
| Warning: You cannot use different names to distinguish between two sub- |
| patterns with the same number because PCRE2 uses only the numbers when |
| matching. For this reason, an error is given at compile time if differ- |
| ent names are given to subpatterns with the same number. However, you |
| can always give the same name to subpatterns with the same number, even |
| when PCRE2_DUPNAMES is not set. |
| |
| |
| REPETITION |
| |
| Repetition is specified by quantifiers, which can follow any of the |
| following items: |
| |
| a literal data character |
| the dot metacharacter |
| the \C escape sequence |
| the \X escape sequence |
| the \R escape sequence |
| an escape such as \d or \pL that matches a single character |
| a character class |
| a back reference |
| a parenthesized subpattern (including most assertions) |
| a subroutine call to a subpattern (recursive or otherwise) |
| |
| The general repetition quantifier specifies a minimum and maximum num- |
| ber of permitted matches, by giving the two numbers in curly brackets |
| (braces), separated by a comma. The numbers must be less than 65536, |
| and the first must be less than or equal to the second. For example: |
| |
| z{2,4} |
| |
| matches "zz", "zzz", or "zzzz". A closing brace on its own is not a |
| special character. If the second number is omitted, but the comma is |
| present, there is no upper limit; if the second number and the comma |
| are both omitted, the quantifier specifies an exact number of required |
| matches. Thus |
| |
| [aeiou]{3,} |
| |
| matches at least 3 successive vowels, but may match many more, whereas |
| |
| \d{8} |
| |
| matches exactly 8 digits. An opening curly bracket that appears in a |
| position where a quantifier is not allowed, or one that does not match |
| the syntax of a quantifier, is taken as a literal character. For exam- |
| ple, {,6} is not a quantifier, but a literal string of four characters. |
| |
| In UTF modes, quantifiers apply to characters rather than to individual |
| code units. Thus, for example, \x{100}{2} matches two characters, each |
| of which is represented by a two-byte sequence in a UTF-8 string. Simi- |
| larly, \X{3} matches three Unicode extended grapheme clusters, each of |
| which may be several code units long (and they may be of different |
| lengths). |
| |
| The quantifier {0} is permitted, causing the expression to behave as if |
| the previous item and the quantifier were not present. This may be use- |
| ful for subpatterns that are referenced as subroutines from elsewhere |
| in the pattern (but see also the section entitled "Defining subpatterns |
| for use by reference only" below). Items other than subpatterns that |
| have a {0} quantifier are omitted from the compiled pattern. |
| |
| For convenience, the three most common quantifiers have single-charac- |
| ter abbreviations: |
| |
| * is equivalent to {0,} |
| + is equivalent to {1,} |
| ? is equivalent to {0,1} |
| |
| It is possible to construct infinite loops by following a subpattern |
| that can match no characters with a quantifier that has no upper limit, |
| for example: |
| |
| (a?)* |
| |
| Earlier versions of Perl and PCRE1 used to give an error at compile |
| time for such patterns. However, because there are cases where this can |
| be useful, such patterns are now accepted, but if any repetition of the |
| subpattern does in fact match no characters, the loop is forcibly bro- |
| ken. |
| |
| By default, the quantifiers are "greedy", that is, they match as much |
| as possible (up to the maximum number of permitted times), without |
| causing the rest of the pattern to fail. The classic example of where |
| this gives problems is in trying to match comments in C programs. These |
| appear between /* and */ and within the comment, individual * and / |
| characters may appear. An attempt to match C comments by applying the |
| pattern |
| |
| /\*.*\*/ |
| |
| to the string |
| |
| /* first comment */ not comment /* second comment */ |
| |
| fails, because it matches the entire string owing to the greediness of |
| the .* item. |
| |
| If a quantifier is followed by a question mark, it ceases to be greedy, |
| and instead matches the minimum number of times possible, so the pat- |
| tern |
| |
| /\*.*?\*/ |
| |
| does the right thing with the C comments. The meaning of the various |
| quantifiers is not otherwise changed, just the preferred number of |
| matches. Do not confuse this use of question mark with its use as a |
| quantifier in its own right. Because it has two uses, it can sometimes |
| appear doubled, as in |
| |
| \d??\d |
| |
| which matches one digit by preference, but can match two if that is the |
| only way the rest of the pattern matches. |
| |
| If the PCRE2_UNGREEDY option is set (an option that is not available in |
| Perl), the quantifiers are not greedy by default, but individual ones |
| can be made greedy by following them with a question mark. In other |
| words, it inverts the default behaviour. |
| |
| When a parenthesized subpattern is quantified with a minimum repeat |
| count that is greater than 1 or with a limited maximum, more memory is |
| required for the compiled pattern, in proportion to the size of the |
| minimum or maximum. |
| |
| If a pattern starts with .* or .{0,} and the PCRE2_DOTALL option |
| (equivalent to Perl's /s) is set, thus allowing the dot to match new- |
| lines, the pattern is implicitly anchored, because whatever follows |
| will be tried against every character position in the subject string, |
| so there is no point in retrying the overall match at any position |
| after the first. PCRE2 normally treats such a pattern as though it were |
| preceded by \A. |
| |
| In cases where it is known that the subject string contains no new- |
| lines, it is worth setting PCRE2_DOTALL in order to obtain this opti- |
| mization, or alternatively, using ^ to indicate anchoring explicitly. |
| |
| However, there are some cases where the optimization cannot be used. |
| When .* is inside capturing parentheses that are the subject of a back |
| reference elsewhere in the pattern, a match at the start may fail where |
| a later one succeeds. Consider, for example: |
| |
| (.*)abc\1 |
| |
| If the subject is "xyz123abc123" the match point is the fourth charac- |
| ter. For this reason, such a pattern is not implicitly anchored. |
| |
| Another case where implicit anchoring is not applied is when the lead- |
| ing .* is inside an atomic group. Once again, a match at the start may |
| fail where a later one succeeds. Consider this pattern: |
| |
| (?>.*?a)b |
| |
| It matches "ab" in the subject "aab". The use of the backtracking con- |
| trol verbs (*PRUNE) and (*SKIP) also disable this optimization, and |
| there is an option, PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly. |
| |
| When a capturing subpattern is repeated, the value captured is the sub- |
| string that matched the final iteration. For example, after |
| |
| (tweedle[dume]{3}\s*)+ |
| |
| has matched "tweedledum tweedledee" the value of the captured substring |
| is "tweedledee". However, if there are nested capturing subpatterns, |
| the corresponding captured values may have been set in previous itera- |
| tions. For example, after |
| |
| (a|(b))+ |
| |
| matches "aba" the value of the second captured substring is "b". |
| |
| |
| ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS |
| |
| With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") |
| repetition, failure of what follows normally causes the repeated item |
| to be re-evaluated to see if a different number of repeats allows the |
| rest of the pattern to match. Sometimes it is useful to prevent this, |
| either to change the nature of the match, or to cause it fail earlier |
| than it otherwise might, when the author of the pattern knows there is |
| no point in carrying on. |
| |
| Consider, for example, the pattern \d+foo when applied to the subject |
| line |
| |
| 123456bar |
| |
| After matching all 6 digits and then failing to match "foo", the normal |
| action of the matcher is to try again with only 5 digits matching the |
| \d+ item, and then with 4, and so on, before ultimately failing. |
| "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides |
| the means for specifying that once a subpattern has matched, it is not |
| to be re-evaluated in this way. |
| |
| If we use atomic grouping for the previous example, the matcher gives |
| up immediately on failing to match "foo" the first time. The notation |
| is a kind of special parenthesis, starting with (?> as in this example: |
| |
| (?>\d+)foo |
| |
| This kind of parenthesis "locks up" the part of the pattern it con- |
| tains once it has matched, and a failure further into the pattern is |
| prevented from backtracking into it. Backtracking past it to previous |
| items, however, works as normal. |
| |
| An alternative description is that a subpattern of this type matches |
| exactly the string of characters that an identical standalone pattern |
| would match, if anchored at the current point in the subject string. |
| |
| Atomic grouping subpatterns are not capturing subpatterns. Simple cases |
| such as the above example can be thought of as a maximizing repeat that |
| must swallow everything it can. So, while both \d+ and \d+? are pre- |
| pared to adjust the number of digits they match in order to make the |
| rest of the pattern match, (?>\d+) can only match an entire sequence of |
| digits. |
| |
| Atomic groups in general can of course contain arbitrarily complicated |
| subpatterns, and can be nested. However, when the subpattern for an |
| atomic group is just a single repeated item, as in the example above, a |
| simpler notation, called a "possessive quantifier" can be used. This |
| consists of an additional + character following a quantifier. Using |
| this notation, the previous example can be rewritten as |
| |
| \d++foo |
| |
| Note that a possessive quantifier can be used with an entire group, for |
| example: |
| |
| (abc|xyz){2,3}+ |
| |
| Possessive quantifiers are always greedy; the setting of the |
| PCRE2_UNGREEDY option is ignored. They are a convenient notation for |
| the simpler forms of atomic group. However, there is no difference in |
| the meaning of a possessive quantifier and the equivalent atomic group, |
| though there may be a performance difference; possessive quantifiers |
| should be slightly faster. |
| |
| The possessive quantifier syntax is an extension to the Perl 5.8 syn- |
| tax. Jeffrey Friedl originated the idea (and the name) in the first |
| edition of his book. Mike McCloskey liked it, so implemented it when he |
| built Sun's Java package, and PCRE1 copied it from there. It ultimately |
| found its way into Perl at release 5.10. |
| |
| PCRE2 has an optimization that automatically "possessifies" certain |
| simple pattern constructs. For example, the sequence A+B is treated as |
| A++B because there is no point in backtracking into a sequence of A's |
| when B must follow. This feature can be disabled by the PCRE2_NO_AUTO- |
| POSSESS option, or starting the pattern with (*NO_AUTO_POSSESS). |
| |
| When a pattern contains an unlimited repeat inside a subpattern that |
| can itself be repeated an unlimited number of times, the use of an |
| atomic group is the only way to avoid some failing matches taking a |
| very long time indeed. The pattern |
| |
| (\D+|<\d+>)*[!?] |
| |
| matches an unlimited number of substrings that either consist of non- |
| digits, or digits enclosed in <>, followed by either ! or ?. When it |
| matches, it runs quickly. However, if it is applied to |
| |
| aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa |
| |
| it takes a long time before reporting failure. This is because the |
| string can be divided between the internal \D+ repeat and the external |
| * repeat in a large number of ways, and all have to be tried. (The |
| example uses [!?] rather than a single character at the end, because |
| both PCRE2 and Perl have an optimization that allows for fast failure |
| when a single character is used. They remember the last single charac- |
| ter that is required for a match, and fail early if it is not present |
| in the string.) If the pattern is changed so that it uses an atomic |
| group, like this: |
| |
| ((?>\D+)|<\d+>)*[!?] |
| |
| sequences of non-digits cannot be broken, and failure happens quickly. |
| |
| |
| BACK REFERENCES |
| |
| Outside a character class, a backslash followed by a digit greater than |
| 0 (and possibly further digits) is a back reference to a capturing sub- |
| pattern earlier (that is, to its left) in the pattern, provided there |
| have been that many previous capturing left parentheses. |
| |
| However, if the decimal number following the backslash is less than 8, |
| it is always taken as a back reference, and causes an error only if |
| there are not that many capturing left parentheses in the entire pat- |
| tern. In other words, the parentheses that are referenced need not be |
| to the left of the reference for numbers less than 8. A "forward back |
| reference" of this type can make sense when a repetition is involved |
| and the subpattern to the right has participated in an earlier itera- |
| tion. |
| |
| It is not possible to have a numerical "forward back reference" to a |
| subpattern whose number is 8 or more using this syntax because a |
| sequence such as \50 is interpreted as a character defined in octal. |
| See the subsection entitled "Non-printing characters" above for further |
| details of the handling of digits following a backslash. There is no |
| such problem when named parentheses are used. A back reference to any |
| subpattern is possible using named parentheses (see below). |
| |
| Another way of avoiding the ambiguity inherent in the use of digits |
| following a backslash is to use the \g escape sequence. This escape |
| must be followed by an unsigned number or a negative number, optionally |
| enclosed in braces. These examples are all identical: |
| |
| (ring), \1 |
| (ring), \g1 |
| (ring), \g{1} |
| |
| An unsigned number specifies an absolute reference without the ambigu- |
| ity that is present in the older syntax. It is also useful when literal |
| digits follow the reference. A negative number is a relative reference. |
| Consider this example: |
| |
| (abc(def)ghi)\g{-1} |
| |
| The sequence \g{-1} is a reference to the most recently started captur- |
| ing subpattern before \g, that is, is it equivalent to \2 in this exam- |
| ple. Similarly, \g{-2} would be equivalent to \1. The use of relative |
| references can be helpful in long patterns, and also in patterns that |
| are created by joining together fragments that contain references |
| within themselves. |
| |
| A back reference matches whatever actually matched the capturing sub- |
| pattern in the current subject string, rather than anything matching |
| the subpattern itself (see "Subpatterns as subroutines" below for a way |
| of doing that). So the pattern |
| |
| (sens|respons)e and \1ibility |
| |
| matches "sense and sensibility" and "response and responsibility", but |
| not "sense and responsibility". If caseful matching is in force at the |
| time of the back reference, the case of letters is relevant. For exam- |
| ple, |
| |
| ((?i)rah)\s+\1 |
| |
| matches "rah rah" and "RAH RAH", but not "RAH rah", even though the |
| original capturing subpattern is matched caselessly. |
| |
| There are several different ways of writing back references to named |
| subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or |
| \k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's |
| unified back reference syntax, in which \g can be used for both numeric |
| and named references, is also supported. We could rewrite the above |
| example in any of the following ways: |
| |
| (?<p1>(?i)rah)\s+\k<p1> |
| (?'p1'(?i)rah)\s+\k{p1} |
| (?P<p1>(?i)rah)\s+(?P=p1) |
| (?<p1>(?i)rah)\s+\g{p1} |
| |
| A subpattern that is referenced by name may appear in the pattern |
| before or after the reference. |
| |
| There may be more than one back reference to the same subpattern. If a |
| subpattern has not actually been used in a particular match, any back |
| references to it always fail by default. For example, the pattern |
| |
| (a|(bc))\2 |
| |
| always fails if it starts to match "a" rather than "bc". However, if |
| the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a back |
| reference to an unset value matches an empty string. |
| |
| Because there may be many capturing parentheses in a pattern, all dig- |
| its following a backslash are taken as part of a potential back refer- |
| ence number. If the pattern continues with a digit character, some |
| delimiter must be used to terminate the back reference. If the |
| PCRE2_EXTENDED option is set, this can be white space. Otherwise, the |
| \g{ syntax or an empty comment (see "Comments" below) can be used. |
| |
| Recursive back references |
| |
| A back reference that occurs inside the parentheses to which it refers |
| fails when the subpattern is first used, so, for example, (a\1) never |
| matches. However, such references can be useful inside repeated sub- |
| patterns. For example, the pattern |
| |
| (a|b\1)+ |
| |
| matches any number of "a"s and also "aba", "ababbaa" etc. At each iter- |
| ation of the subpattern, the back reference matches the character |
| string corresponding to the previous iteration. In order for this to |
| work, the pattern must be such that the first iteration does not need |
| to match the back reference. This can be done using alternation, as in |
| the example above, or by a quantifier with a minimum of zero. |
| |
| Back references of this type cause the group that they reference to be |
| treated as an atomic group. Once the whole group has been matched, a |
| subsequent matching failure cannot cause backtracking into the middle |
| of the group. |
| |
| |
| ASSERTIONS |
| |
| An assertion is a test on the characters following or preceding the |
| current matching point that does not consume any characters. The simple |
| assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described |
| above. |
| |
| More complicated assertions are coded as subpatterns. There are two |
| kinds: those that look ahead of the current position in the subject |
| string, and those that look behind it. An assertion subpattern is |
| matched in the normal way, except that it does not cause the current |
| matching position to be changed. |
| |
| Assertion subpatterns are not capturing subpatterns. If such an asser- |
| tion contains capturing subpatterns within it, these are counted for |
| the purposes of numbering the capturing subpatterns in the whole pat- |
| tern. However, substring capturing is carried out only for positive |
| assertions. (Perl sometimes, but not always, does do capturing in nega- |
| tive assertions.) |
| |
| For compatibility with Perl, most assertion subpatterns may be |
| repeated; though it makes no sense to assert the same thing several |
| times, the side effect of capturing parentheses may occasionally be |
| useful. However, an assertion that forms the condition for a condi- |
| tional subpattern may not be quantified. In practice, for other asser- |
| tions, there only three cases: |
| |
| (1) If the quantifier is {0}, the assertion is never obeyed during |
| matching. However, it may contain internal capturing parenthesized |
| groups that are called from elsewhere via the subroutine mechanism. |
| |
| (2) If quantifier is {0,n} where n is greater than zero, it is treated |
| as if it were {0,1}. At run time, the rest of the pattern match is |
| tried with and without the assertion, the order depending on the greed- |
| iness of the quantifier. |
| |
| (3) If the minimum repetition is greater than zero, the quantifier is |
| ignored. The assertion is obeyed just once when encountered during |
| matching. |
| |
| Lookahead assertions |
| |
| Lookahead assertions start with (?= for positive assertions and (?! for |
| negative assertions. For example, |
| |
| \w+(?=;) |
| |
| matches a word followed by a semicolon, but does not include the semi- |
| colon in the match, and |
| |
| foo(?!bar) |
| |
| matches any occurrence of "foo" that is not followed by "bar". Note |
| that the apparently similar pattern |
| |
| (?!foo)bar |
| |
| does not find an occurrence of "bar" that is preceded by something |
| other than "foo"; it finds any occurrence of "bar" whatsoever, because |
| the assertion (?!foo) is always true when the next three characters are |
| "bar". A lookbehind assertion is needed to achieve the other effect. |
| |
| If you want to force a matching failure at some point in a pattern, the |
| most convenient way to do it is with (?!) because an empty string |
| always matches, so an assertion that requires there not to be an empty |
| string must always fail. The backtracking control verb (*FAIL) or (*F) |
| is a synonym for (?!). |
| |
| Lookbehind assertions |
| |
| Lookbehind assertions start with (?<= for positive assertions and (?<! |
| for negative assertions. For example, |
| |
| (?<!foo)bar |
| |
| does find an occurrence of "bar" that is not preceded by "foo". The |
| contents of a lookbehind assertion are restricted such that all the |
| strings it matches must have a fixed length. However, if there are sev- |
| eral top-level alternatives, they do not all have to have the same |
| fixed length. Thus |
| |
| (?<=bullock|donkey) |
| |
| is permitted, but |
| |
| (?<!dogs?|cats?) |
| |
| causes an error at compile time. Branches that match different length |
| strings are permitted only at the top level of a lookbehind assertion. |
| This is an extension compared with Perl, which requires all branches to |
| match the same length of string. An assertion such as |
| |
| (?<=ab(c|de)) |
| |
| is not permitted, because its single top-level branch can match two |
| different lengths, but it is acceptable to PCRE2 if rewritten to use |
| two top-level branches: |
| |
| (?<=abc|abde) |
| |
| In some cases, the escape sequence \K (see above) can be used instead |
| of a lookbehind assertion to get round the fixed-length restriction. |
| |
| The implementation of lookbehind assertions is, for each alternative, |
| to temporarily move the current position back by the fixed length and |
| then try to match. If there are insufficient characters before the cur- |
| rent position, the assertion fails. |
| |
| In a UTF mode, PCRE2 does not allow the \C escape (which matches a sin- |
| gle code unit even in a UTF mode) to appear in lookbehind assertions, |
| because it makes it impossible to calculate the length of the lookbe- |
| hind. The \X and \R escapes, which can match different numbers of code |
| units, are also not permitted. |
| |
| "Subroutine" calls (see below) such as (?2) or (?&X) are permitted in |
| lookbehinds, as long as the subpattern matches a fixed-length string. |
| Recursion, however, is not supported. |
| |
| Possessive quantifiers can be used in conjunction with lookbehind |
| assertions to specify efficient matching of fixed-length strings at the |
| end of subject strings. Consider a simple pattern such as |
| |
| abcd$ |
| |
| when applied to a long string that does not match. Because matching |
| proceeds from left to right, PCRE2 will look for each "a" in the sub- |
| ject and then see if what follows matches the rest of the pattern. If |
| the pattern is specified as |
| |
| ^.*abcd$ |
| |
| the initial .* matches the entire string at first, but when this fails |
| (because there is no following "a"), it backtracks to match all but the |
| last character, then all but the last two characters, and so on. Once |
| again the search for "a" covers the entire string, from right to left, |
| so we are no better off. However, if the pattern is written as |
| |
| ^.*+(?<=abcd) |
| |
| there can be no backtracking for the .*+ item because of the possessive |
| quantifier; it can match only the entire string. The subsequent lookbe- |
| hind assertion does a single test on the last four characters. If it |
| fails, the match fails immediately. For long strings, this approach |
| makes a significant difference to the processing time. |
| |
| Using multiple assertions |
| |
| Several assertions (of any sort) may occur in succession. For example, |
| |
| (?<=\d{3})(?<!999)foo |
| |
| matches "foo" preceded by three digits that are not "999". Notice that |
| each of the assertions is applied independently at the same point in |
| the subject string. First there is a check that the previous three |
| characters are all digits, and then there is a check that the same |
| three characters are not "999". This pattern does not match "foo" pre- |
| ceded by six characters, the first of which are digits and the last |
| three of which are not "999". For example, it doesn't match "123abc- |
| foo". A pattern to do that is |
| |
| (?<=\d{3}...)(?<!999)foo |
| |
| This time the first assertion looks at the preceding six characters, |
| checking that the first three are digits, and then the second assertion |
| checks that the preceding three characters are not "999". |
| |
| Assertions can be nested in any combination. For example, |
| |
| (?<=(?<!foo)bar)baz |
| |
| matches an occurrence of "baz" that is preceded by "bar" which in turn |
| is not preceded by "foo", while |
| |
| (?<=\d{3}(?!999)...)foo |
| |
| is another pattern that matches "foo" preceded by three digits and any |
| three characters that are not "999". |
| |
| |
| CONDITIONAL SUBPATTERNS |
| |
| It is possible to cause the matching process to obey a subpattern con- |
| ditionally or to choose between two alternative subpatterns, depending |
| on the result of an assertion, or whether a specific capturing subpat- |
| tern has already been matched. The two possible forms of conditional |
| subpattern are: |
| |
| (?(condition)yes-pattern) |
| (?(condition)yes-pattern|no-pattern) |
| |
| If the condition is satisfied, the yes-pattern is used; otherwise the |
| no-pattern (if present) is used. If there are more than two alterna- |
| tives in the subpattern, a compile-time error occurs. Each of the two |
| alternatives may itself contain nested subpatterns of any form, includ- |
| ing conditional subpatterns; the restriction to two alternatives |
| applies only at the level of the condition. This pattern fragment is an |
| example where the alternatives are complex: |
| |
| (?(1) (A|B|C) | (D | (?(2)E|F) | E) ) |
| |
| |
| There are five kinds of condition: references to subpatterns, refer- |
| ences to recursion, two pseudo-conditions called DEFINE and VERSION, |
| and assertions. |
| |
| Checking for a used subpattern by number |
| |
| If the text between the parentheses consists of a sequence of digits, |
| the condition is true if a capturing subpattern of that number has pre- |
| viously matched. If there is more than one capturing subpattern with |
| the same number (see the earlier section about duplicate subpattern |
| numbers), the condition is true if any of them have matched. An alter- |
| native notation is to precede the digits with a plus or minus sign. In |
| this case, the subpattern number is relative rather than absolute. The |
| most recently opened parentheses can be referenced by (?(-1), the next |
| most recent by (?(-2), and so on. Inside loops it can also make sense |
| to refer to subsequent groups. The next parentheses to be opened can be |
| referenced as (?(+1), and so on. (The value zero in any of these forms |
| is not used; it provokes a compile-time error.) |
| |
| Consider the following pattern, which contains non-significant white |
| space to make it more readable (assume the PCRE2_EXTENDED option) and |
| to divide it into three parts for ease of discussion: |
| |
| ( \( )? [^()]+ (?(1) \) ) |
| |
| The first part matches an optional opening parenthesis, and if that |
| character is present, sets it as the first captured substring. The sec- |
| ond part matches one or more characters that are not parentheses. The |
| third part is a conditional subpattern that tests whether or not the |
| first set of parentheses matched. If they did, that is, if subject |
| started with an opening parenthesis, the condition is true, and so the |
| yes-pattern is executed and a closing parenthesis is required. Other- |
| wise, since no-pattern is not present, the subpattern matches nothing. |
| In other words, this pattern matches a sequence of non-parentheses, |
| optionally enclosed in parentheses. |
| |
| If you were embedding this pattern in a larger one, you could use a |
| relative reference: |
| |
| ...other stuff... ( \( )? [^()]+ (?(-1) \) ) ... |
| |
| This makes the fragment independent of the parentheses in the larger |
| pattern. |
| |
| Checking for a used subpattern by name |
| |
| Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a |
| used subpattern by name. For compatibility with earlier versions of |
| PCRE1, which had this facility before Perl, the syntax (?(name)...) is |
| also recognized. |
| |
| Rewriting the above example to use a named subpattern gives this: |
| |
| (?<OPEN> \( )? [^()]+ (?(<OPEN>) \) ) |
| |
| If the name used in a condition of this kind is a duplicate, the test |
| is applied to all subpatterns of the same name, and is true if any one |
| of them has matched. |
| |
| Checking for pattern recursion |
| |
| If the condition is the string (R), and there is no subpattern with the |
| name R, the condition is true if a recursive call to the whole pattern |
| or any subpattern has been made. If digits or a name preceded by amper- |
| sand follow the letter R, for example: |
| |
| (?(R3)...) or (?(R&name)...) |
| |
| the condition is true if the most recent recursion is into a subpattern |
| whose number or name is given. This condition does not check the entire |
| recursion stack. If the name used in a condition of this kind is a |
| duplicate, the test is applied to all subpatterns of the same name, and |
| is true if any one of them is the most recent recursion. |
| |
| At "top level", all these recursion test conditions are false. The |
| syntax for recursive patterns is described below. |
| |
| Defining subpatterns for use by reference only |
| |
| If the condition is the string (DEFINE), and there is no subpattern |
| with the name DEFINE, the condition is always false. In this case, |
| there may be only one alternative in the subpattern. It is always |
| skipped if control reaches this point in the pattern; the idea of |
| DEFINE is that it can be used to define subroutines that can be refer- |
| enced from elsewhere. (The use of subroutines is described below.) For |
| example, a pattern to match an IPv4 address such as "192.168.23.245" |
| could be written like this (ignore white space and line breaks): |
| |
| (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) |
| \b (?&byte) (\.(?&byte)){3} \b |
| |
| The first part of the pattern is a DEFINE group inside which a another |
| group named "byte" is defined. This matches an individual component of |
| an IPv4 address (a number less than 256). When matching takes place, |
| this part of the pattern is skipped because DEFINE acts like a false |
| condition. The rest of the pattern uses references to the named group |
| to match the four dot-separated components of an IPv4 address, insist- |
| ing on a word boundary at each end. |
| |
| Checking the PCRE2 version |
| |
| Programs that link with a PCRE2 library can check the version by call- |
| ing pcre2_config() with appropriate arguments. Users of applications |
| that do not have access to the underlying code cannot do this. A spe- |
| cial "condition" called VERSION exists to allow such users to discover |
| which version of PCRE2 they are dealing with by using this condition to |
| match a string such as "yesno". VERSION must be followed either by "=" |
| or ">=" and a version number. For example: |
| |
| (?(VERSION>=10.4)yes|no) |
| |
| This pattern matches "yes" if the PCRE2 version is greater or equal to |
| 10.4, or "no" otherwise. The fractional part of the version number may |
| not contain more than two digits. |
| |
| Assertion conditions |
| |
| If the condition is not in any of the above formats, it must be an |
| assertion. This may be a positive or negative lookahead or lookbehind |
| assertion. Consider this pattern, again containing non-significant |
| white space, and with the two alternatives on the second line: |
| |
| (?(?=[^a-z]*[a-z]) |
| \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) |
| |
| The condition is a positive lookahead assertion that matches an |
| optional sequence of non-letters followed by a letter. In other words, |
| it tests for the presence of at least one letter in the subject. If a |
| letter is found, the subject is matched against the first alternative; |
| otherwise it is matched against the second. This pattern matches |
| strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are |
| letters and dd are digits. |
| |
| |
| COMMENTS |
| |
| There are two ways of including comments in patterns that are processed |
| by PCRE2. In both cases, the start of the comment must not be in a |
| character class, nor in the middle of any other sequence of related |
| characters such as (?: or a subpattern name or number. The characters |
| that make up a comment play no part in the pattern matching. |
| |
| The sequence (?# marks the start of a comment that continues up to the |
| next closing parenthesis. Nested parentheses are not permitted. If the |
| PCRE2_EXTENDED option is set, an unescaped # character also introduces |
| a comment, which in this case continues to immediately after the next |
| newline character or character sequence in the pattern. Which charac- |
| ters are interpreted as newlines is controlled by an option passed to |
| the compiling function or by a special sequence at the start of the |
| pattern, as described in the section entitled "Newline conventions" |
| above. Note that the end of this type of comment is a literal newline |
| sequence in the pattern; escape sequences that happen to represent a |
| newline do not count. For example, consider this pattern when |
| PCRE2_EXTENDED is set, and the default newline convention (a single |
| linefeed character) is in force: |
| |
| abc #comment \n still comment |
| |
| On encountering the # character, pcre2_compile() skips along, looking |
| for a newline in the pattern. The sequence \n is still literal at this |
| stage, so it does not terminate the comment. Only an actual character |
| with the code value 0x0a (the default newline) does so. |
| |
| |
| RECURSIVE PATTERNS |
| |
| Consider the problem of matching a string in parentheses, allowing for |
| unlimited nested parentheses. Without the use of recursion, the best |
| that can be done is to use a pattern that matches up to some fixed |
| depth of nesting. It is not possible to handle an arbitrary nesting |
| depth. |
| |
| For some time, Perl has provided a facility that allows regular expres- |
| sions to recurse (amongst other things). It does this by interpolating |
| Perl code in the expression at run time, and the code can refer to the |
| expression itself. A Perl pattern using code interpolation to solve the |
| parentheses problem can be created like this: |
| |
| $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x; |
| |
| The (?p{...}) item interpolates Perl code at run time, and in this case |
| refers recursively to the pattern in which it appears. |
| |
| Obviously, PCRE2 cannot support the interpolation of Perl code. |
| Instead, it supports special syntax for recursion of the entire pat- |
| tern, and also for individual subpattern recursion. After its introduc- |
| tion in PCRE1 and Python, this kind of recursion was subsequently |
| introduced into Perl at release 5.10. |
| |
| A special item that consists of (? followed by a number greater than |
| zero and a closing parenthesis is a recursive subroutine call of the |
| subpattern of the given number, provided that it occurs inside that |
| subpattern. (If not, it is a non-recursive subroutine call, which is |
| described in the next section.) The special item (?R) or (?0) is a |
| recursive call of the entire regular expression. |
| |
| This PCRE2 pattern solves the nested parentheses problem (assume the |
| PCRE2_EXTENDED option is set so that white space is ignored): |
| |
| \( ( [^()]++ | (?R) )* \) |
| |
| First it matches an opening parenthesis. Then it matches any number of |
| substrings which can either be a sequence of non-parentheses, or a |
| recursive match of the pattern itself (that is, a correctly parenthe- |
| sized substring). Finally there is a closing parenthesis. Note the use |
| of a possessive quantifier to avoid backtracking into sequences of non- |
| parentheses. |
| |
| If this were part of a larger pattern, you would not want to recurse |
| the entire pattern, so instead you could use this: |
| |
| ( \( ( [^()]++ | (?1) )* \) ) |
| |
| We have put the pattern into parentheses, and caused the recursion to |
| refer to them instead of the whole pattern. |
| |
| In a larger pattern, keeping track of parenthesis numbers can be |
| tricky. This is made easier by the use of relative references. Instead |
| of (?1) in the pattern above you can write (?-2) to refer to the second |
| most recently opened parentheses preceding the recursion. In other |
| words, a negative number counts capturing parentheses leftwards from |
| the point at which it is encountered. |
| |
| Be aware however, that if duplicate subpattern numbers are in use, rel- |
| ative references refer to the earliest subpattern with the appropriate |
| number. Consider, for example: |
| |
| (?|(a)|(b)) (c) (?-2) |
| |
| The first two capturing groups (a) and (b) are both numbered 1, and |
| group (c) is number 2. When the reference (?-2) is encountered, the |
| second most recently opened parentheses has the number 1, but it is the |
| first such group (the (a) group) to which the recursion refers. This |
| would be the same if an absolute reference (?1) was used. In other |
| words, relative references are just a shorthand for computing a group |
| number. |
| |
| It is also possible to refer to subsequently opened parentheses, by |
| writing references such as (?+2). However, these cannot be recursive |
| because the reference is not inside the parentheses that are refer- |
| enced. They are always non-recursive subroutine calls, as described in |
| the next section. |
| |
| An alternative approach is to use named parentheses. The Perl syntax |
| for this is (?&name); PCRE1's earlier syntax (?P>name) is also sup- |
| ported. We could rewrite the above example as follows: |
| |
| (?<pn> \( ( [^()]++ | (?&pn) )* \) ) |
| |
| If there is more than one subpattern with the same name, the earliest |
| one is used. |
| |
| The example pattern that we have been looking at contains nested unlim- |
| ited repeats, and so the use of a possessive quantifier for matching |
| strings of non-parentheses is important when applying the pattern to |
| strings that do not match. For example, when this pattern is applied to |
| |
| (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
| |
| it yields "no match" quickly. However, if a possessive quantifier is |
| not used, the match runs for a very long time indeed because there are |
| so many different ways the + and * repeats can carve up the subject, |
| and all have to be tested before failure can be reported. |
| |
| At the end of a match, the values of capturing parentheses are those |
| from the outermost level. If you want to obtain intermediate values, a |
| callout function can be used (see below and the pcre2callout documenta- |
| tion). If the pattern above is matched against |
| |
| (ab(cd)ef) |
| |
| the value for the inner capturing parentheses (numbered 2) is "ef", |
| which is the last value taken on at the top level. If a capturing sub- |
| pattern is not matched at the top level, its final captured value is |
| unset, even if it was (temporarily) set at a deeper level during the |
| matching process. |
| |
| If there are more than 15 capturing parentheses in a pattern, PCRE2 has |
| to obtain extra memory from the heap to store data during a recursion. |
| If no memory can be obtained, the match fails with the |
| PCRE2_ERROR_NOMEMORY error. |
| |
| Do not confuse the (?R) item with the condition (R), which tests for |
| recursion. Consider this pattern, which matches text in angle brack- |
| ets, allowing for arbitrary nesting. Only digits are allowed in nested |
| brackets (that is, when recursing), whereas any characters are permit- |
| ted at the outer level. |
| |
| < (?: (?(R) \d++ | [^<>]*+) | (?R)) * > |
| |
| In this pattern, (?(R) is the start of a conditional subpattern, with |
| two different alternatives for the recursive and non-recursive cases. |
| The (?R) item is the actual recursive call. |
| |
| Differences in recursion processing between PCRE2 and Perl |
| |
| Recursion processing in PCRE2 differs from Perl in two important ways. |
| In PCRE2 (like Python, but unlike Perl), a recursive subpattern call is |
| always treated as an atomic group. That is, once it has matched some of |
| the subject string, it is never re-entered, even if it contains untried |
| alternatives and there is a subsequent matching failure. This can be |
| illustrated by the following pattern, which purports to match a palin- |
| dromic string that contains an odd number of characters (for example, |
| "a", "aba", "abcba", "abcdcba"): |
| |
| ^(.|(.)(?1)\2)$ |
| |
| The idea is that it either matches a single character, or two identical |
| characters surrounding a sub-palindrome. In Perl, this pattern works; |
| in PCRE2 it does not if the pattern is longer than three characters. |
| Consider the subject string "abcba": |
| |
| At the top level, the first character is matched, but as it is not at |
| the end of the string, the first alternative fails; the second alterna- |
| tive is taken and the recursion kicks in. The recursive call to subpat- |
| tern 1 successfully matches the next character ("b"). (Note that the |
| beginning and end of line tests are not part of the recursion). |
| |
| Back at the top level, the next character ("c") is compared with what |
| subpattern 2 matched, which was "a". This fails. Because the recursion |
| is treated as an atomic group, there are now no backtracking points, |
| and so the entire match fails. (Perl is able, at this point, to re- |
| enter the recursion and try the second alternative.) However, if the |
| pattern is written with the alternatives in the other order, things are |
| different: |
| |
| ^((.)(?1)\2|.)$ |
| |
| This time, the recursing alternative is tried first, and continues to |
| recurse until it runs out of characters, at which point the recursion |
| fails. But this time we do have another alternative to try at the |
| higher level. That is the big difference: in the previous case the |
| remaining alternative is at a deeper recursion level, which PCRE2 can- |
| not use. |
| |
| To change the pattern so that it matches all palindromic strings, not |
| just those with an odd number of characters, it is tempting to change |
| the pattern to this: |
| |
| ^((.)(?1)\2|.?)$ |
| |
| Again, this works in Perl, but not in PCRE2, and for the same reason. |
| When a deeper recursion has matched a single character, it cannot be |
| entered again in order to match an empty string. The solution is to |
| separate the two cases, and write out the odd and even cases as alter- |
| natives at the higher level: |
| |
| ^(?:((.)(?1)\2|)|((.)(?3)\4|.)) |
| |
| If you want to match typical palindromic phrases, the pattern has to |
| ignore all non-word characters, which can be done like this: |
| |
| ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$ |
| |
| If run with the PCRE2_CASELESS option, this pattern matches phrases |
| such as "A man, a plan, a canal: Panama!" and it works in both PCRE2 |
| and Perl. Note the use of the possessive quantifier *+ to avoid back- |
| tracking into sequences of non-word characters. Without this, PCRE2 |
| takes a great deal longer (ten times or more) to match typical phrases, |
| and Perl takes so long that you think it has gone into a loop. |
| |
| WARNING: The palindrome-matching patterns above work only if the sub- |
| ject string does not start with a palindrome that is shorter than the |
| entire string. For example, although "abcba" is correctly matched, if |
| the subject is "ababa", PCRE2 finds the palindrome "aba" at the start, |
| then fails at top level because the end of the string does not follow. |
| Once again, it cannot jump back into the recursion to try other alter- |
| natives, so the entire match fails. |
| |
| The second way in which PCRE2 and Perl differ in their recursion pro- |
| cessing is in the handling of captured values. In Perl, when a subpat- |
| tern is called recursively or as a subpattern (see the next section), |
| it has no access to any values that were captured outside the recur- |
| sion, whereas in PCRE2 these values can be referenced. Consider this |
| pattern: |
| |
| ^(.)(\1|a(?2)) |
| |
| In PCRE2, this pattern matches "bab". The first capturing parentheses |
| match "b", then in the second group, when the back reference \1 fails |
| to match "b", the second alternative matches "a" and then recurses. In |
| the recursion, \1 does now match "b" and so the whole match succeeds. |
| In Perl, the pattern fails to match because inside the recursive call |
| \1 cannot access the externally set value. |
| |
| |
| SUBPATTERNS AS SUBROUTINES |
| |
| If the syntax for a recursive subpattern call (either by number or by |
| name) is used outside the parentheses to which it refers, it operates |
| like a subroutine in a programming language. The called subpattern may |
| be defined before or after the reference. A numbered reference can be |
| absolute or relative, as in these examples: |
| |
| (...(absolute)...)...(?2)... |
| (...(relative)...)...(?-1)... |
| (...(?+1)...(relative)... |
| |
| An earlier example pointed out that the pattern |
| |
| (sens|respons)e and \1ibility |
| |
| matches "sense and sensibility" and "response and responsibility", but |
| not "sense and responsibility". If instead the pattern |
| |
| (sens|respons)e and (?1)ibility |
| |
| is used, it does match "sense and responsibility" as well as the other |
| two strings. Another example is given in the discussion of DEFINE |
| above. |
| |
| All subroutine calls, whether recursive or not, are always treated as |
| atomic groups. That is, once a subroutine has matched some of the sub- |
| ject string, it is never re-entered, even if it contains untried alter- |
| natives and there is a subsequent matching failure. Any capturing |
| parentheses that are set during the subroutine call revert to their |
| previous values afterwards. |
| |
| Processing options such as case-independence are fixed when a subpat- |
| tern is defined, so if it is used as a subroutine, such options cannot |
| be changed for different calls. For example, consider this pattern: |
| |
| (abc)(?i:(?-1)) |
| |
| It matches "abcabc". It does not match "abcABC" because the change of |
| processing option does not affect the called subpattern. |
| |
| |
| ONIGURUMA SUBROUTINE SYNTAX |
| |
| For compatibility with Oniguruma, the non-Perl syntax \g followed by a |
| name or a number enclosed either in angle brackets or single quotes, is |
| an alternative syntax for referencing a subpattern as a subroutine, |
| possibly recursively. Here are two of the examples used above, rewrit- |
| ten using this syntax: |
| |
| (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) ) |
| (sens|respons)e and \g'1'ibility |
| |
| PCRE2 supports an extension to Oniguruma: if a number is preceded by a |
| plus or a minus sign it is taken as a relative reference. For example: |
| |
| (abc)(?i:\g<-1>) |
| |
| Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not |
| synonymous. The former is a back reference; the latter is a subroutine |
| call. |
| |
| |
| CALLOUTS |
| |
| Perl has a feature whereby using the sequence (?{...}) causes arbitrary |
| Perl code to be obeyed in the middle of matching a regular expression. |
| This makes it possible, amongst other things, to extract different sub- |
| strings that match the same pair of parentheses when there is a repeti- |
| tion. |
| |
| PCRE2 provides a similar feature, but of course it cannot obey arbi- |
| trary Perl code. The feature is called "callout". The caller of PCRE2 |
| provides an external function by putting its entry point in a match |
| context using the function pcre2_set_callout(), and then passing that |
| context to pcre2_match() or pcre2_dfa_match(). If no match context is |
| passed, or if the callout entry point is set to NULL, callouts are dis- |
| abled. |
| |
| Within a regular expression, (?C<arg>) indicates a point at which the |
| external function is to be called. There are two kinds of callout: |
| those with a numerical argument and those with a string argument. (?C) |
| on its own with no argument is treated as (?C0). A numerical argument |
| allows the application to distinguish between different callouts. |
| String arguments were added for release 10.20 to make it possible for |
| script languages that use PCRE2 to embed short scripts within patterns |
| in a similar way to Perl. |
| |
| During matching, when PCRE2 reaches a callout point, the external func- |
| tion is called. It is provided with the number or string argument of |
| the callout, the position in the pattern, and one item of data that is |
| also set in the match block. The callout function may cause matching to |
| proceed, to backtrack, or to fail. |
| |
| By default, PCRE2 implements a number of optimizations at matching |
| time, and one side-effect is that sometimes callouts are skipped. If |
| you need all possible callouts to happen, you need to set options that |
| disable the relevant optimizations. More details, including a complete |
| description of the programming interface to the callout function, are |
| given in the pcre2callout documentation. |
| |
| Callouts with numerical arguments |
| |
| If you just want to have a means of identifying different callout |
| points, put a number less than 256 after the letter C. For example, |
| this pattern has two callout points: |
| |
| (?C1)abc(?C2)def |
| |
| If the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(), numerical |
| callouts are automatically installed before each item in the pattern. |
| They are all numbered 255. If there is a conditional group in the pat- |
| tern whose condition is an assertion, an additional callout is inserted |
| just before the condition. An explicit callout may also be set at this |
| position, as in this example: |
| |
| (?(?C9)(?=a)abc|def) |
| |
| Note that this applies only to assertion conditions, not to other types |
| of condition. |
| |
| Callouts with string arguments |
| |
| A delimited string may be used instead of a number as a callout argu- |
| ment. The starting delimiter must be one of ` ' " ^ % # $ { and the |
| ending delimiter is the same as the start, except for {, where the end- |
| ing delimiter is }. If the ending delimiter is needed within the |
| string, it must be doubled. For example: |
| |
| (?C'ab ''c'' d')xyz(?C{any text})pqr |
| |
| The doubling is removed before the string is passed to the callout |
| function. |
| |
| |
| BACKTRACKING CONTROL |
| |
| Perl 5.10 introduced a number of "Special Backtracking Control Verbs", |
| which are still described in the Perl documentation as "experimental |
| and subject to change or removal in a future version of Perl". It goes |
| on to say: "Their usage in production code should be noted to avoid |
| problems during upgrades." The same remarks apply to the PCRE2 features |
| described in this section. |
| |
| The new verbs make use of what was previously invalid syntax: an open- |
| ing parenthesis followed by an asterisk. They are generally of the form |
| (*VERB) or (*VERB:NAME). Some verbs take either form, possibly behaving |
| differently depending on whether or not a name is present. |
| |
| By default, for compatibility with Perl, a name is any sequence of |
| characters that does not include a closing parenthesis. The name is not |
| processed in any way, and it is not possible to include a closing |
| parenthesis in the name. However, if the PCRE2_ALT_VERBNAMES option is |
| set, normal backslash processing is applied to verb names and only an |
| unescaped closing parenthesis terminates the name. A closing parenthe- |
| sis can be included in a name either as \) or between \Q and \E. If the |
| PCRE2_EXTENDED option is set, unescaped whitespace in verb names is |
| skipped and #-comments are recognized, exactly as in the rest of the |
| pattern. |
| |
| The maximum length of a name is 255 in the 8-bit library and 65535 in |
| the 16-bit and 32-bit libraries. If the name is empty, that is, if the |
| closing parenthesis immediately follows the colon, the effect is as if |
| the colon were not there. Any number of these verbs may occur in a pat- |
| tern. |
| |
| Since these verbs are specifically related to backtracking, most of |
| them can be used only when the pattern is to be matched using the tra- |
| ditional matching function, because these use a backtracking algorithm. |
| With the exception of (*FAIL), which behaves like a failing negative |
| assertion, the backtracking control verbs cause an error if encountered |
| by the DFA matching function. |
| |
| The behaviour of these verbs in repeated groups, assertions, and in |
| subpatterns called as subroutines (whether or not recursively) is docu- |
| mented below. |
| |
| Optimizations that affect backtracking verbs |
| |
| PCRE2 contains some optimizations that are used to speed up matching by |
| running some checks at the start of each match attempt. For example, it |
| may know the minimum length of matching subject, or that a particular |
| character must be present. When one of these optimizations bypasses the |
| running of a match, any included backtracking verbs will not, of |
| course, be processed. You can suppress the start-of-match optimizations |
| by setting the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com- |
| pile(), or by starting the pattern with (*NO_START_OPT). There is more |
| discussion of this option in the section entitled "Compiling a pattern" |
| in the pcre2api documentation. |
| |
| Experiments with Perl suggest that it too has similar optimizations, |
| sometimes leading to anomalous results. |
| |
| Verbs that act immediately |
| |
| The following verbs act as soon as they are encountered. They may not |
| be followed by a name. |
| |
| (*ACCEPT) |
| |
| This verb causes the match to end successfully, skipping the remainder |
| of the pattern. However, when it is inside a subpattern that is called |
| as a subroutine, only that subpattern is ended successfully. Matching |
| then continues at the outer level. If (*ACCEPT) in triggered in a posi- |
| tive assertion, the assertion succeeds; in a negative assertion, the |
| assertion fails. |
| |
| If (*ACCEPT) is inside capturing parentheses, the data so far is cap- |
| tured. For example: |
| |
| A((?:A|B(*ACCEPT)|C)D) |
| |
| This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap- |
| tured by the outer parentheses. |
| |
| (*FAIL) or (*F) |
| |
| This verb causes a matching failure, forcing backtracking to occur. It |
| is equivalent to (?!) but easier to read. The Perl documentation notes |
| that it is probably useful only when combined with (?{}) or (??{}). |
| Those are, of course, Perl features that are not present in PCRE2. The |
| nearest equivalent is the callout feature, as for example in this pat- |
| tern: |
| |
| a+(?C)(*FAIL) |
| |
| A match with the string "aaaa" always fails, but the callout is taken |
| before each backtrack happens (in this example, 10 times). |
| |
| Recording which path was taken |
| |
| There is one verb whose main purpose is to track how a match was |
| arrived at, though it also has a secondary use in conjunction with |
| advancing the match starting point (see (*SKIP) below). |
| |
| (*MARK:NAME) or (*:NAME) |
| |
| A name is always required with this verb. There may be as many |
| instances of (*MARK) as you like in a pattern, and their names do not |
| have to be unique. |
| |
| When a match succeeds, the name of the last-encountered (*MARK:NAME), |
| (*PRUNE:NAME), or (*THEN:NAME) on the matching path is passed back to |
| the caller as described in the section entitled "Other information |
| about the match" in the pcre2api documentation. Here is an example of |
| pcre2test output, where the "mark" modifier requests the retrieval and |
| outputting of (*MARK) data: |
| |
| re> /X(*MARK:A)Y|X(*MARK:B)Z/mark |
| data> XY |
| 0: XY |
| MK: A |
| XZ |
| 0: XZ |
| MK: B |
| |
| The (*MARK) name is tagged with "MK:" in this output, and in this exam- |
| ple it indicates which of the two alternatives matched. This is a more |
| efficient way of obtaining this information than putting each alterna- |
| tive in its own capturing parentheses. |
| |
| If a verb with a name is encountered in a positive assertion that is |
| true, the name is recorded and passed back if it is the last-encoun- |
| tered. This does not happen for negative assertions or failing positive |
| assertions. |
| |
| After a partial match or a failed match, the last encountered name in |
| the entire match process is returned. For example: |
| |
| re> /X(*MARK:A)Y|X(*MARK:B)Z/mark |
| data> XP |
| No match, mark = B |
| |
| Note that in this unanchored example the mark is retained from the |
| match attempt that started at the letter "X" in the subject. Subsequent |
| match attempts starting at "P" and then with an empty string do not get |
| as far as the (*MARK) item, but nevertheless do not reset it. |
| |
| If you are interested in (*MARK) values after failed matches, you |
| should probably set the PCRE2_NO_START_OPTIMIZE option (see above) to |
| ensure that the match is always attempted. |
| |
| Verbs that act after backtracking |
| |
| The following verbs do nothing when they are encountered. Matching con- |
| tinues with what follows, but if there is no subsequent match, causing |
| a backtrack to the verb, a failure is forced. That is, backtracking |
| cannot pass to the left of the verb. However, when one of these verbs |
| appears inside an atomic group (which includes any group that is called |
| as a subroutine) or in an assertion that is true, its effect is con- |
| fined to that group, because once the group has been matched, there is |
| never any backtracking into it. In this situation, backtracking has to |
| jump to the left of the entire atomic group or assertion. |
| |
| These verbs differ in exactly what kind of failure occurs when back- |
| tracking reaches them. The behaviour described below is what happens |
| when the verb is not in a subroutine or an assertion. Subsequent sec- |
| tions cover these special cases. |
| |
| (*COMMIT) |
| |
| This verb, which may not be followed by a name, causes the whole match |
| to fail outright if there is a later matching failure that causes back- |
| tracking to reach it. Even if the pattern is unanchored, no further |
| attempts to find a match by advancing the starting point take place. If |
| (*COMMIT) is the only backtracking verb that is encountered, once it |
| has been passed pcre2_match() is committed to finding a match at the |
| current starting point, or not at all. For example: |
| |
| a+(*COMMIT)b |
| |
| This matches "xxaab" but not "aacaab". It can be thought of as a kind |
| of dynamic anchor, or "I've started, so I must finish." The name of the |
| most recently passed (*MARK) in the path is passed back when (*COMMIT) |
| forces a match failure. |
| |
| If there is more than one backtracking verb in a pattern, a different |
| one that follows (*COMMIT) may be triggered first, so merely passing |
| (*COMMIT) during a match does not always guarantee that a match must be |
| at this starting point. |
| |
| Note that (*COMMIT) at the start of a pattern is not the same as an |
| anchor, unless PCRE2's start-of-match optimizations are turned off, as |
| shown in this output from pcre2test: |
| |
| re> /(*COMMIT)abc/ |
| data> xyzabc |
| 0: abc |
| data> |
| re> /(*COMMIT)abc/no_start_optimize |
| data> xyzabc |
| No match |
| |
| For the first pattern, PCRE2 knows that any match must start with "a", |
| so the optimization skips along the subject to "a" before applying the |
| pattern to the first set of data. The match attempt then succeeds. The |
| second pattern disables the optimization that skips along to the first |
| character. The pattern is now applied starting at "x", and so the |
| (*COMMIT) causes the match to fail without trying any other starting |
| points. |
| |
| (*PRUNE) or (*PRUNE:NAME) |
| |
| This verb causes the match to fail at the current starting position in |
| the subject if there is a later matching failure that causes backtrack- |
| ing to reach it. If the pattern is unanchored, the normal "bumpalong" |
| advance to the next starting character then happens. Backtracking can |
| occur as usual to the left of (*PRUNE), before it is reached, or when |
| matching to the right of (*PRUNE), but if there is no match to the |
| right, backtracking cannot cross (*PRUNE). In simple cases, the use of |
| (*PRUNE) is just an alternative to an atomic group or possessive quan- |
| tifier, but there are some uses of (*PRUNE) that cannot be expressed in |
| any other way. In an anchored pattern (*PRUNE) has the same effect as |
| (*COMMIT). |
| |
| The behaviour of (*PRUNE:NAME) is the not the same as |
| (*MARK:NAME)(*PRUNE). It is like (*MARK:NAME) in that the name is |
| remembered for passing back to the caller. However, (*SKIP:NAME) |
| searches only for names set with (*MARK), ignoring those set by |
| (*PRUNE) or (*THEN). |
| |
| (*SKIP) |
| |
| This verb, when given without a name, is like (*PRUNE), except that if |
| the pattern is unanchored, the "bumpalong" advance is not to the next |
| character, but to the position in the subject where (*SKIP) was encoun- |
| tered. (*SKIP) signifies that whatever text was matched leading up to |
| it cannot be part of a successful match. Consider: |
| |
| a+(*SKIP)b |
| |
| If the subject is "aaaac...", after the first match attempt fails |
| (starting at the first character in the string), the starting point |
| skips on to start the next attempt at "c". Note that a possessive quan- |
| tifer does not have the same effect as this example; although it would |
| suppress backtracking during the first match attempt, the second |
| attempt would start at the second character instead of skipping on to |
| "c". |
| |
| (*SKIP:NAME) |
| |
| When (*SKIP) has an associated name, its behaviour is modified. When it |
| is triggered, the previous path through the pattern is searched for the |
| most recent (*MARK) that has the same name. If one is found, the |
| "bumpalong" advance is to the subject position that corresponds to that |
| (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with |
| a matching name is found, the (*SKIP) is ignored. |
| |
| Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It |
| ignores names that are set by (*PRUNE:NAME) or (*THEN:NAME). |
| |
| (*THEN) or (*THEN:NAME) |
| |
| This verb causes a skip to the next innermost alternative when back- |
| tracking reaches it. That is, it cancels any further backtracking |
| within the current alternative. Its name comes from the observation |
| that it can be used for a pattern-based if-then-else block: |
| |
| ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... |
| |
| If the COND1 pattern matches, FOO is tried (and possibly further items |
| after the end of the group if FOO succeeds); on failure, the matcher |
| skips to the second alternative and tries COND2, without backtracking |
| into COND1. If that succeeds and BAR fails, COND3 is tried. If subse- |
| quently BAZ fails, there are no more alternatives, so there is a back- |
| track to whatever came before the entire group. If (*THEN) is not |
| inside an alternation, it acts like (*PRUNE). |
| |
| The behaviour of (*THEN:NAME) is the not the same as |
| (*MARK:NAME)(*THEN). It is like (*MARK:NAME) in that the name is |
| remembered for passing back to the caller. However, (*SKIP:NAME) |
| searches only for names set with (*MARK), ignoring those set by |
| (*PRUNE) and (*THEN). |
| |
| A subpattern that does not contain a | character is just a part of the |
| enclosing alternative; it is not a nested alternation with only one |
| alternative. The effect of (*THEN) extends beyond such a subpattern to |
| the enclosing alternative. Consider this pattern, where A, B, etc. are |
| complex pattern fragments that do not contain any | characters at this |
| level: |
| |
| A (B(*THEN)C) | D |
| |
| If A and B are matched, but there is a failure in C, matching does not |
| backtrack into A; instead it moves to the next alternative, that is, D. |
| However, if the subpattern containing (*THEN) is given an alternative, |
| it behaves differently: |
| |
| A (B(*THEN)C | (*FAIL)) | D |
| |
| The effect of (*THEN) is now confined to the inner subpattern. After a |
| failure in C, matching moves to (*FAIL), which causes the whole subpat- |
| tern to fail because there are no more alternatives to try. In this |
| case, matching does now backtrack into A. |
| |
| Note that a conditional subpattern is not considered as having two |
| alternatives, because only one is ever used. In other words, the | |
| character in a conditional subpattern has a different meaning. Ignoring |
| white space, consider: |
| |
| ^.*? (?(?=a) a | b(*THEN)c ) |
| |
| If the subject is "ba", this pattern does not match. Because .*? is |
| ungreedy, it initially matches zero characters. The condition (?=a) |
| then fails, the character "b" is matched, but "c" is not. At this |
| point, matching does not backtrack to .*? as might perhaps be expected |
| from the presence of the | character. The conditional subpattern is |
| part of the single alternative that comprises the whole pattern, and so |
| the match fails. (If there was a backtrack into .*?, allowing it to |
| match "b", the match would succeed.) |
| |
| The verbs just described provide four different "strengths" of control |
| when subsequent matching fails. (*THEN) is the weakest, carrying on the |
| match at the next alternative. (*PRUNE) comes next, failing the match |
| at the current starting position, but allowing an advance to the next |
| character (for an unanchored pattern). (*SKIP) is similar, except that |
| the advance may be more than one character. (*COMMIT) is the strongest, |
| causing the entire match to fail. |
| |
| More than one backtracking verb |
| |
| If more than one backtracking verb is present in a pattern, the one |
| that is backtracked onto first acts. For example, consider this pat- |
| tern, where A, B, etc. are complex pattern fragments: |
| |
| (A(*COMMIT)B(*THEN)C|ABD) |
| |
| If A matches but B fails, the backtrack to (*COMMIT) causes the entire |
| match to fail. However, if A and B match, but C fails, the backtrack to |
| (*THEN) causes the next alternative (ABD) to be tried. This behaviour |
| is consistent, but is not always the same as Perl's. It means that if |
| two or more backtracking verbs appear in succession, all the the last |
| of them has no effect. Consider this example: |
| |
| ...(*COMMIT)(*PRUNE)... |
| |
| If there is a matching failure to the right, backtracking onto (*PRUNE) |
| causes it to be triggered, and its action is taken. There can never be |
| a backtrack onto (*COMMIT). |
| |
| Backtracking verbs in repeated groups |
| |
| PCRE2 differs from Perl in its handling of backtracking verbs in |
| repeated groups. For example, consider: |
| |
| /(a(*COMMIT)b)+ac/ |
| |
| If the subject is "abac", Perl matches, but PCRE2 fails because the |
| (*COMMIT) in the second repeat of the group acts. |
| |
| Backtracking verbs in assertions |
| |
| (*FAIL) in an assertion has its normal effect: it forces an immediate |
| backtrack. |
| |
| (*ACCEPT) in a positive assertion causes the assertion to succeed with- |
| out any further processing. In a negative assertion, (*ACCEPT) causes |
| the assertion to fail without any further processing. |
| |
| The other backtracking verbs are not treated specially if they appear |
| in a positive assertion. In particular, (*THEN) skips to the next |
| alternative in the innermost enclosing group that has alternations, |
| whether or not this is within the assertion. |
| |
| Negative assertions are, however, different, in order to ensure that |
| changing a positive assertion into a negative assertion changes its |
| result. Backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes a neg- |
| ative assertion to be true, without considering any further alternative |
| branches in the assertion. Backtracking into (*THEN) causes it to skip |
| to the next enclosing alternative within the assertion (the normal be- |
| haviour), but if the assertion does not have such an alternative, |
| (*THEN) behaves like (*PRUNE). |
| |
| Backtracking verbs in subroutines |
| |
| These behaviours occur whether or not the subpattern is called recur- |
| sively. Perl's treatment of subroutines is different in some cases. |
| |
| (*FAIL) in a subpattern called as a subroutine has its normal effect: |
| it forces an immediate backtrack. |
| |
| (*ACCEPT) in a subpattern called as a subroutine causes the subroutine |
| match to succeed without any further processing. Matching then contin- |
| ues after the subroutine call. |
| |
| (*COMMIT), (*SKIP), and (*PRUNE) in a subpattern called as a subroutine |
| cause the subroutine match to fail. |
| |
| (*THEN) skips to the next alternative in the innermost enclosing group |
| within the subpattern that has alternatives. If there is no such group |
| within the subpattern, (*THEN) causes the subroutine match to fail. |
| |
| |
| SEE ALSO |
| |
| pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2syntax(3), |
| pcre2(3). |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 13 November 2015 |
| Copyright (c) 1997-2015 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |
| PCRE2PERFORM(3) Library Functions Manual PCRE2PERFORM(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| |
| PCRE2 PERFORMANCE |
| |
| Two aspects of performance are discussed below: memory usage and pro- |
| cessing time. The way you express your pattern as a regular expression |
| can affect both of them. |
| |
| |
| COMPILED PATTERN MEMORY USAGE |
| |
| Patterns are compiled by PCRE2 into a reasonably efficient interpretive |
| code, so that most simple patterns do not use much memory. However, |
| there is one case where the memory usage of a compiled pattern can be |
| unexpectedly large. If a parenthesized subpattern has a quantifier with |
| a minimum greater than 1 and/or a limited maximum, the whole subpattern |
| is repeated in the compiled code. For example, the pattern |
| |
| (abc|def){2,4} |
| |
| is compiled as if it were |
| |
| (abc|def)(abc|def)((abc|def)(abc|def)?)? |
| |
| (Technical aside: It is done this way so that backtrack points within |
| each of the repetitions can be independently maintained.) |
| |
| For regular expressions whose quantifiers use only small numbers, this |
| is not usually a problem. However, if the numbers are large, and par- |
| ticularly if such repetitions are nested, the memory usage can become |
| an embarrassment. For example, the very simple pattern |
| |
| ((ab){1,1000}c){1,3} |
| |
| uses 51K bytes when compiled using the 8-bit library. When PCRE2 is |
| compiled with its default internal pointer size of two bytes, the size |
| limit on a compiled pattern is 64K code units in the 8-bit and 16-bit |
| libraries, and this is reached with the above pattern if the outer rep- |
| etition is increased from 3 to 4. PCRE2 can be compiled to use larger |
| internal pointers and thus handle larger compiled patterns, but it is |
| better to try to rewrite your pattern to use less memory if you can. |
| |
| One way of reducing the memory usage for such patterns is to make use |
| of PCRE2's "subroutine" facility. Re-writing the above pattern as |
| |
| ((ab)(?2){0,999}c)(?1){0,2} |
| |
| reduces the memory requirements to 18K, and indeed it remains under 20K |
| even with the outer repetition increased to 100. However, this pattern |
| is not exactly equivalent, because the "subroutine" calls are treated |
| as atomic groups into which there can be no backtracking if there is a |
| subsequent matching failure. Therefore, PCRE2 cannot do this kind of |
| rewriting automatically. Furthermore, there is a noticeable loss of |
| speed when executing the modified pattern. Nevertheless, if the atomic |
| grouping is not a problem and the loss of speed is acceptable, this |
| kind of rewriting will allow you to process patterns that PCRE2 cannot |
| otherwise handle. |
| |
| |
| STACK USAGE AT RUN TIME |
| |
| When pcre2_match() is used for matching, certain kinds of pattern can |
| cause it to use large amounts of the process stack. In some environ- |
| ments the default process stack is quite small, and if it runs out the |
| result is often SIGSEGV. Rewriting your pattern can often help. The |
| pcre2stack documentation discusses this issue in detail. |
| |
| |
| PROCESSING TIME |
| |
| Certain items in regular expression patterns are processed more effi- |
| ciently than others. It is more efficient to use a character class like |
| [aeiou] than a set of single-character alternatives such as |
| (a|e|i|o|u). In general, the simplest construction that provides the |
| required behaviour is usually the most efficient. Jeffrey Friedl's book |
| contains a lot of useful general discussion about optimizing regular |
| expressions for efficient performance. This document contains a few |
| observations about PCRE2. |
| |
| Using Unicode character properties (the \p, \P, and \X escapes) is |
| slow, because PCRE2 has to use a multi-stage table lookup whenever it |
| needs a character's property. If you can find an alternative pattern |
| that does not use character properties, it will probably be faster. |
| |
| By default, the escape sequences \b, \d, \s, and \w, and the POSIX |
| character classes such as [:alpha:] do not use Unicode properties, |
| partly for backwards compatibility, and partly for performance reasons. |
| However, you can set the PCRE2_UCP option or start the pattern with |
| (*UCP) if you want Unicode character properties to be used. This can |
| double the matching time for items such as \d, when matched with |
| pcre2_match(); the performance loss is less with a DFA matching func- |
| tion, and in both cases there is not much difference for \b. |
| |
| When a pattern begins with .* not in atomic parentheses, nor in paren- |
| theses that are the subject of a backreference, and the PCRE2_DOTALL |
| option is set, the pattern is implicitly anchored by PCRE2, since it |
| can match only at the start of a subject string. If the pattern has |
| multiple top-level branches, they must all be anchorable. The optimiza- |
| tion can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is |
| automatically disabled if the pattern contains (*PRUNE) or (*SKIP). |
| |
| If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization, |
| because the dot metacharacter does not then match a newline, and if the |
| subject string contains newlines, the pattern may match from the char- |
| acter immediately following one of them instead of from the very start. |
| For example, the pattern |
| |
| .*second |
| |
| matches the subject "first\nand second" (where \n stands for a newline |
| character), with the match starting at the seventh character. In order |
| to do this, PCRE2 has to retry the match starting after every newline |
| in the subject. |
| |
| If you are using such a pattern with subject strings that do not con- |
| tain newlines, the best performance is obtained by setting |
| PCRE2_DOTALL, or starting the pattern with ^.* or ^.*? to indicate |
| explicit anchoring. That saves PCRE2 from having to scan along the sub- |
| ject looking for a newline to restart at. |
| |
| Beware of patterns that contain nested indefinite repeats. These can |
| take a long time to run when applied to a string that does not match. |
| Consider the pattern fragment |
| |
| ^(a+)* |
| |
| This can match "aaaa" in 16 different ways, and this number increases |
| very rapidly as the string gets longer. (The * repeat can match 0, 1, |
| 2, 3, or 4 times, and for each of those cases other than 0 or 4, the + |
| repeats can match different numbers of times.) When the remainder of |
| the pattern is such that the entire match is going to fail, PCRE2 has |
| in principle to try every possible variation, and this can take an |
| extremely long time, even for relatively short strings. |
| |
| An optimization catches some of the more simple cases such as |
| |
| (a+)*b |
| |
| where a literal character follows. Before embarking on the standard |
| matching procedure, PCRE2 checks that there is a "b" later in the sub- |
| ject string, and if there is not, it fails the match immediately. How- |
| ever, when there is no following literal this optimization cannot be |
| used. You can see the difference by comparing the behaviour of |
| |
| (a+)*\d |
| |
| with the pattern above. The former gives a failure almost instantly |
| when applied to a whole line of "a" characters, whereas the latter |
| takes an appreciable time with strings longer than about 20 characters. |
| |
| In many cases, the solution to this kind of performance issue is to use |
| an atomic group or a possessive quantifier. |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 02 January 2015 |
| Copyright (c) 1997-2015 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |
| PCRE2POSIX(3) Library Functions Manual PCRE2POSIX(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| |
| SYNOPSIS |
| |
| #include <pcre2posix.h> |
| |
| int regcomp(regex_t *preg, const char *pattern, |
| int cflags); |
| |
| int regexec(const regex_t *preg, const char *string, |
| size_t nmatch, regmatch_t pmatch[], int eflags); |
| |
| size_t regerror(int errcode, const regex_t *preg, |
| char *errbuf, size_t errbuf_size); |
| |
| void regfree(regex_t *preg); |
| |
| |
| DESCRIPTION |
| |
| This set of functions provides a POSIX-style API for the PCRE2 regular |
| expression 8-bit library. See the pcre2api documentation for a descrip- |
| tion of PCRE2's native API, which contains much additional functional- |
| ity. There is no POSIX-style wrapper for PCRE2's 16-bit and 32-bit |
| libraries. |
| |
| The functions described here are just wrapper functions that ultimately |
| call the PCRE2 native API. Their prototypes are defined in the |
| pcre2posix.h header file, and on Unix systems the library itself is |
| called libpcre2-posix.a, so can be accessed by adding -lpcre2-posix to |
| the command for linking an application that uses them. Because the |
| POSIX functions call the native ones, it is also necessary to add |
| -lpcre2-8. |
| |
| Those POSIX option bits that can reasonably be mapped to PCRE2 native |
| options have been implemented. In addition, the option REG_EXTENDED is |
| defined with the value zero. This has no effect, but since programs |
| that are written to the POSIX interface often use it, this makes it |
| easier to slot in PCRE2 as a replacement library. Other POSIX options |
| are not even defined. |
| |
| There are also some other options that are not defined by POSIX. These |
| have been added at the request of users who want to make use of certain |
| PCRE2-specific features via the POSIX calling interface. |
| |
| When PCRE2 is called via these functions, it is only the API that is |
| POSIX-like in style. The syntax and semantics of the regular expres- |
| sions themselves are still those of Perl, subject to the setting of |
| various PCRE2 options, as described below. "POSIX-like in style" means |
| that the API approximates to the POSIX definition; it is not fully |
| POSIX-compatible, and in multi-unit encoding domains it is probably |
| even less compatible. |
| |
| The header for these functions is supplied as pcre2posix.h to avoid any |
| potential clash with other POSIX libraries. It can, of course, be |
| renamed or aliased as regex.h, which is the "correct" name. It provides |
| two structure types, regex_t for compiled internal forms, and reg- |
| match_t for returning captured substrings. It also defines some con- |
| stants whose names start with "REG_"; these are used for setting |
| options and identifying error codes. |
| |
| |
| COMPILING A PATTERN |
| |
| The function regcomp() is called to compile a pattern into an internal |
| form. The pattern is a C string terminated by a binary zero, and is |
| passed in the argument pattern. The preg argument is a pointer to a |
| regex_t structure that is used as a base for storing information about |
| the compiled regular expression. |
| |
| The argument cflags is either zero, or contains one or more of the bits |
| defined by the following macros: |
| |
| REG_DOTALL |
| |
| The PCRE2_DOTALL option is set when the regular expression is passed |
| for compilation to the native function. Note that REG_DOTALL is not |
| part of the POSIX standard. |
| |
| REG_ICASE |
| |
| The PCRE2_CASELESS option is set when the regular expression is passed |
| for compilation to the native function. |
| |
| REG_NEWLINE |
| |
| The PCRE2_MULTILINE option is set when the regular expression is passed |
| for compilation to the native function. Note that this does not mimic |
| the defined POSIX behaviour for REG_NEWLINE (see the following sec- |
| tion). |
| |
| REG_NOSUB |
| |
| The PCRE2_NO_AUTO_CAPTURE option is set when the regular expression is |
| passed for compilation to the native function. In addition, when a pat- |
| tern that is compiled with this flag is passed to regexec() for match- |
| ing, the nmatch and pmatch arguments are ignored, and no captured |
| strings are returned. |
| |
| REG_UCP |
| |
| The PCRE2_UCP option is set when the regular expression is passed for |
| compilation to the native function. This causes PCRE2 to use Unicode |
| properties when matchine \d, \w, etc., instead of just recognizing |
| ASCII values. Note that REG_UCP is not part of the POSIX standard. |
| |
| REG_UNGREEDY |
| |
| The PCRE2_UNGREEDY option is set when the regular expression is passed |
| for compilation to the native function. Note that REG_UNGREEDY is not |
| part of the POSIX standard. |
| |
| REG_UTF |
| |
| The PCRE2_UTF option is set when the regular expression is passed for |
| compilation to the native function. This causes the pattern itself and |
| all data strings used for matching it to be treated as UTF-8 strings. |
| Note that REG_UTF is not part of the POSIX standard. |
| |
| In the absence of these flags, no options are passed to the native |
| function. This means the the regex is compiled with PCRE2 default |
| semantics. In particular, the way it handles newline characters in the |
| subject string is the Perl way, not the POSIX way. Note that setting |
| PCRE2_MULTILINE has only some of the effects specified for REG_NEWLINE. |
| It does not affect the way newlines are matched by the dot metacharac- |
| ter (they are not) or by a negative class such as [^a] (they are). |
| |
| The yield of regcomp() is zero on success, and non-zero otherwise. The |
| preg structure is filled in on success, and one member of the structure |
| is public: re_nsub contains the number of capturing subpatterns in the |
| regular expression. Various error codes are defined in the header file. |
| |
| NOTE: If the yield of regcomp() is non-zero, you must not attempt to |
| use the contents of the preg structure. If, for example, you pass it to |
| regexec(), the result is undefined and your program is likely to crash. |
| |
| |
| MATCHING NEWLINE CHARACTERS |
| |
| This area is not simple, because POSIX and Perl take different views of |
| things. It is not possible to get PCRE2 to obey POSIX semantics, but |
| then PCRE2 was never intended to be a POSIX engine. The following table |
| lists the different possibilities for matching newline characters in |
| Perl and PCRE2: |
| |
| Default Change with |
| |
| . matches newline no PCRE2_DOTALL |
| newline matches [^a] yes not changeable |
| $ matches \n at end yes PCRE2_DOLLAR_ENDONLY |
| $ matches \n in middle no PCRE2_MULTILINE |
| ^ matches \n in middle no PCRE2_MULTILINE |
| |
| This is the equivalent table for a POSIX-compatible pattern matcher: |
| |
| Default Change with |
| |
| . matches newline yes REG_NEWLINE |
| newline matches [^a] yes REG_NEWLINE |
| $ matches \n at end no REG_NEWLINE |
| $ matches \n in middle no REG_NEWLINE |
| ^ matches \n in middle no REG_NEWLINE |
| |
| This behaviour is not what happens when PCRE2 is called via its POSIX |
| API. By default, PCRE2's behaviour is the same as Perl's, except that |
| there is no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2 |
| and Perl, there is no way to stop newline from matching [^a]. |
| |
| Default POSIX newline handling can be obtained by setting PCRE2_DOTALL |
| and PCRE2_DOLLAR_ENDONLY when calling pcre2_compile() directly, but |
| there is no way to make PCRE2 behave exactly as for the REG_NEWLINE |
| action. When using the POSIX API, passing REG_NEWLINE to PCRE2's reg- |
| comp() function causes PCRE2_MULTILINE to be passed to pcre2_compile(), |
| and REG_DOTALL passes PCRE2_DOTALL. There is no way to pass PCRE2_DOL- |
| LAR_ENDONLY. |
| |
| |
| MATCHING A PATTERN |
| |
| The function regexec() is called to match a compiled pattern preg |
| against a given string, which is by default terminated by a zero byte |
| (but see REG_STARTEND below), subject to the options in eflags. These |
| can be: |
| |
| REG_NOTBOL |
| |
| The PCRE2_NOTBOL option is set when calling the underlying PCRE2 match- |
| ing function. |
| |
| REG_NOTEMPTY |
| |
| The PCRE2_NOTEMPTY option is set when calling the underlying PCRE2 |
| matching function. Note that REG_NOTEMPTY is not part of the POSIX |
| standard. However, setting this option can give more POSIX-like behav- |
| iour in some situations. |
| |
| REG_NOTEOL |
| |
| The PCRE2_NOTEOL option is set when calling the underlying PCRE2 match- |
| ing function. |
| |
| REG_STARTEND |
| |
| The string is considered to start at string + pmatch[0].rm_so and to |
| have a terminating NUL located at string + pmatch[0].rm_eo (there need |
| not actually be a NUL at that location), regardless of the value of |
| nmatch. This is a BSD extension, compatible with but not specified by |
| IEEE Standard 1003.2 (POSIX.2), and should be used with caution in |
| software intended to be portable to other systems. Note that a non-zero |
| rm_so does not imply REG_NOTBOL; REG_STARTEND affects only the location |
| of the string, not how it is matched. Setting REG_STARTEND and passing |
| pmatch as NULL are mutually exclusive; the error REG_INVARG is |
| returned. |
| |
| If the pattern was compiled with the REG_NOSUB flag, no data about any |
| matched strings is returned. The nmatch and pmatch arguments of |
| regexec() are ignored. |
| |
| If the value of nmatch is zero, or if the value pmatch is NULL, no data |
| about any matched strings is returned. |
| |
| Otherwise,the portion of the string that was matched, and also any cap- |
| tured substrings, are returned via the pmatch argument, which points to |
| an array of nmatch structures of type regmatch_t, containing the mem- |
| bers rm_so and rm_eo. These contain the byte offset to the first char- |
| acter of each substring and the offset to the first character after the |
| end of each substring, respectively. The 0th element of the vector |
| relates to the entire portion of string that was matched; subsequent |
| elements relate to the capturing subpatterns of the regular expression. |
| Unused entries in the array have both structure members set to -1. |
| |
| A successful match yields a zero return; various error codes are |
| defined in the header file, of which REG_NOMATCH is the "expected" |
| failure code. |
| |
| |
| ERROR MESSAGES |
| |
| The regerror() function maps a non-zero errorcode from either regcomp() |
| or regexec() to a printable message. If preg is not NULL, the error |
| should have arisen from the use of that structure. A message terminated |
| by a binary zero is placed in errbuf. If the buffer is too short, only |
| the first errbuf_size - 1 characters of the error message are used. The |
| yield of the function is the size of buffer needed to hold the whole |
| message, including the terminating zero. This value is greater than |
| errbuf_size if the message was truncated. |
| |
| |
| MEMORY USAGE |
| |
| Compiling a regular expression causes memory to be allocated and asso- |
| ciated with the preg structure. The function regfree() frees all such |
| memory, after which preg may no longer be used as a compiled expres- |
| sion. |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 29 November 2015 |
| Copyright (c) 1997-2015 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |
| PCRE2SAMPLE(3) Library Functions Manual PCRE2SAMPLE(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| |
| PCRE2 SAMPLE PROGRAM |
| |
| A simple, complete demonstration program to get you started with using |
| PCRE2 is supplied in the file pcre2demo.c in the src directory in the |
| PCRE2 distribution. A listing of this program is given in the pcre2demo |
| documentation. If you do not have a copy of the PCRE2 distribution, you |
| can save this listing to re-create the contents of pcre2demo.c. |
| |
| The demonstration program, which uses the PCRE2 8-bit library, compiles |
| the regular expression that is its first argument, and matches it |
| against the subject string in its second argument. No PCRE2 options are |
| set, and default character tables are used. If matching succeeds, the |
| program outputs the portion of the subject that matched, together with |
| the contents of any captured substrings. |
| |
| If the -g option is given on the command line, the program then goes on |
| to check for further matches of the same regular expression in the same |
| subject string. The logic is a little bit tricky because of the possi- |
| bility of matching an empty string. Comments in the code explain what |
| is going on. |
| |
| If PCRE2 is installed in the standard include and library directories |
| for your operating system, you should be able to compile the demonstra- |
| tion program using this command: |
| |
| gcc -o pcre2demo pcre2demo.c -lpcre2-8 |
| |
| If PCRE2 is installed elsewhere, you may need to add additional options |
| to the command line. For example, on a Unix-like system that has PCRE2 |
| installed in /usr/local, you can compile the demonstration program |
| using a command like this: |
| |
| gcc -o pcre2demo -I/usr/local/include pcre2demo.c \ |
| -L/usr/local/lib -lpcre2-8 |
| |
| |
| Once you have compiled and linked the demonstration program, you can |
| run simple tests like this: |
| |
| ./pcre2demo 'cat|dog' 'the cat sat on the mat' |
| ./pcre2demo -g 'cat|dog' 'the dog sat on the cat' |
| |
| Note that there is a much more comprehensive test program, called |
| pcre2test, which supports many more facilities for testing regular |
| expressions using the PCRE2 libraries. The pcre2demo program is pro- |
| vided as a simple coding example. |
| |
| If you try to run pcre2demo when PCRE2 is not installed in the standard |
| library directory, you may get an error like this on some operating |
| systems (e.g. Solaris): |
| |
| ld.so.1: a.out: fatal: libpcre2.so.0: open failed: No such file or |
| directory |
| |
| This is caused by the way shared library support works on those sys- |
| tems. You need to add |
| |
| -R/usr/local/lib |
| |
| (for example) to the compile command to get round this problem. |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 20 October 2014 |
| Copyright (c) 1997-2014 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| PCRE2SERIALIZE(3) Library Functions Manual PCRE2SERIALIZE(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| |
| SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS |
| |
| int32_t pcre2_serialize_decode(pcre2_code **codes, |
| int32_t number_of_codes, const uint32_t *bytes, |
| pcre2_general_context *gcontext); |
| |
| int32_t pcre2_serialize_encode(pcre2_code **codes, |
| int32_t number_of_codes, uint32_t **serialized_bytes, |
| PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext); |
| |
| void pcre2_serialize_free(uint8_t *bytes); |
| |
| int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes); |
| |
| If you are running an application that uses a large number of regular |
| expression patterns, it may be useful to store them in a precompiled |
| form instead of having to compile them every time the application is |
| run. However, if you are using the just-in-time optimization feature, |
| it is not possible to save and reload the JIT data, because it is posi- |
| tion-dependent. The host on which the patterns are reloaded must be |
| running the same version of PCRE2, with the same code unit width, and |
| must also have the same endianness, pointer width and PCRE2_SIZE type. |
| For example, patterns compiled on a 32-bit system using PCRE2's 16-bit |
| library cannot be reloaded on a 64-bit system, nor can they be reloaded |
| using the 8-bit library. |
| |
| |
| SAVING COMPILED PATTERNS |
| |
| Before compiled patterns can be saved they must be serialized, that is, |
| converted to a stream of bytes. A single byte stream may contain any |
| number of compiled patterns, but they must all use the same character |
| tables. A single copy of the tables is included in the byte stream (its |
| size is 1088 bytes). For more details of character tables, see the sec- |
| tion on locale support in the pcre2api documentation. |
| |
| The function pcre2_serialize_encode() creates a serialized byte stream |
| from a list of compiled patterns. Its first two arguments specify the |
| list, being a pointer to a vector of pointers to compiled patterns, and |
| the length of the vector. The third and fourth arguments point to vari- |
| ables which are set to point to the created byte stream and its length, |
| respectively. The final argument is a pointer to a general context, |
| which can be used to specify custom memory mangagement functions. If |
| this argument is NULL, malloc() is used to obtain memory for the byte |
| stream. The yield of the function is the number of serialized patterns, |
| or one of the following negative error codes: |
| |
| PCRE2_ERROR_BADDATA the number of patterns is zero or less |
| PCRE2_ERROR_BADMAGIC mismatch of id bytes in one of the patterns |
| PCRE2_ERROR_MEMORY memory allocation failed |
| PCRE2_ERROR_MIXEDTABLES the patterns do not all use the same tables |
| PCRE2_ERROR_NULL the 1st, 3rd, or 4th argument is NULL |
| |
| PCRE2_ERROR_BADMAGIC means either that a pattern's code has been cor- |
| rupted, or that a slot in the vector does not point to a compiled pat- |
| tern. |
| |
| Once a set of patterns has been serialized you can save the data in any |
| appropriate manner. Here is sample code that compiles two patterns and |
| writes them to a file. It assumes that the variable fd refers to a file |
| that is open for output. The error checking that should be present in a |
| real application has been omitted for simplicity. |
| |
| int errorcode; |
| uint8_t *bytes; |
| PCRE2_SIZE erroroffset; |
| PCRE2_SIZE bytescount; |
| pcre2_code *list_of_codes[2]; |
| list_of_codes[0] = pcre2_compile("first pattern", |
| PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL); |
| list_of_codes[1] = pcre2_compile("second pattern", |
| PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL); |
| errorcode = pcre2_serialize_encode(list_of_codes, 2, &bytes, |
| &bytescount, NULL); |
| errorcode = fwrite(bytes, 1, bytescount, fd); |
| |
| Note that the serialized data is binary data that may contain any of |
| the 256 possible byte values. On systems that make a distinction |
| between binary and non-binary data, be sure that the file is opened for |
| binary output. |
| |
| Serializing a set of patterns leaves the original data untouched, so |
| they can still be used for matching. Their memory must eventually be |
| freed in the usual way by calling pcre2_code_free(). When you have fin- |
| ished with the byte stream, it too must be freed by calling pcre2_seri- |
| alize_free(). |
| |
| |
| RE-USING PRECOMPILED PATTERNS |
| |
| In order to re-use a set of saved patterns you must first make the |
| serialized byte stream available in main memory (for example, by read- |
| ing from a file). The management of this memory block is up to the |
| application. You can use the pcre2_serialize_get_number_of_codes() |
| function to find out how many compiled patterns are in the serialized |
| data without actually decoding the patterns: |
| |
| uint8_t *bytes = <serialized data>; |
| int32_t number_of_codes = pcre2_serialize_get_number_of_codes(bytes); |
| |
| The pcre2_serialize_decode() function reads a byte stream and recreates |
| the compiled patterns in new memory blocks, setting pointers to them in |
| a vector. The first two arguments are a pointer to a suitable vector |
| and its length, and the third argument points to a byte stream. The |
| final argument is a pointer to a general context, which can be used to |
| specify custom memory mangagement functions for the decoded patterns. |
| If this argument is NULL, malloc() and free() are used. After deserial- |
| ization, the byte stream is no longer needed and can be discarded. |
| |
| int32_t number_of_codes; |
| pcre2_code *list_of_codes[2]; |
| uint8_t *bytes = <serialized data>; |
| int32_t number_of_codes = |
| pcre2_serialize_decode(list_of_codes, 2, bytes, NULL); |
| |
| If the vector is not large enough for all the patterns in the byte |
| stream, it is filled with those that fit, and the remainder are |
| ignored. The yield of the function is the number of decoded patterns, |
| or one of the following negative error codes: |
| |
| PCRE2_ERROR_BADDATA second argument is zero or less |
| PCRE2_ERROR_BADMAGIC mismatch of id bytes in the data |
| PCRE2_ERROR_BADMODE mismatch of variable unit size or PCRE2 version |
| PCRE2_ERROR_MEMORY memory allocation failed |
| PCRE2_ERROR_NULL first or third argument is NULL |
| |
| PCRE2_ERROR_BADMAGIC may mean that the data is corrupt, or that it was |
| compiled on a system with different endianness. |
| |
| Decoded patterns can be used for matching in the usual way, and must be |
| freed by calling pcre2_code_free(). However, be aware that there is a |
| potential race issue if you are using multiple patterns that were |
| decoded from a single byte stream in a multithreaded application. A |
| single copy of the character tables is used by all the decoded patterns |
| and a reference count is used to arrange for its memory to be automati- |
| cally freed when the last pattern is freed, but there is no locking on |
| this reference count. Therefore, if you want to call pcre2_code_free() |
| for these patterns in different threads, you must arrange your own |
| locking, and ensure that pcre2_code_free() cannot be called by two |
| threads at the same time. |
| |
| If a pattern was processed by pcre2_jit_compile() before being serial- |
| ized, the JIT data is discarded and so is no longer available after a |
| save/restore cycle. You can, however, process a restored pattern with |
| pcre2_jit_compile() if you wish. |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 03 November 2015 |
| Copyright (c) 1997-2015 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |
| PCRE2STACK(3) Library Functions Manual PCRE2STACK(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| |
| PCRE2 DISCUSSION OF STACK USAGE |
| |
| When you call pcre2_match(), it makes use of an internal function |
| called match(). This calls itself recursively at branch points in the |
| pattern, in order to remember the state of the match so that it can |
| back up and try a different alternative after a failure. As matching |
| proceeds deeper and deeper into the tree of possibilities, the recur- |
| sion depth increases. The match() function is also called in other cir- |
| cumstances, for example, whenever a parenthesized sub-pattern is |
| entered, and in certain cases of repetition. |
| |
| Not all calls of match() increase the recursion depth; for an item such |
| as a* it may be called several times at the same level, after matching |
| different numbers of a's. Furthermore, in a number of cases where the |
| result of the recursive call would immediately be passed back as the |
| result of the current call (a "tail recursion"), the function is just |
| restarted instead. |
| |
| Each time the internal match() function is called recursively, it uses |
| memory from the process stack. For certain kinds of pattern and data, |
| very large amounts of stack may be needed, despite the recognition of |
| "tail recursion". Note that if PCRE2 is compiled with the -fsani- |
| tize=address option of the GCC compiler, the stack requirements are |
| greatly increased. |
| |
| The above comments apply when pcre2_match() is run in its normal inter- |
| pretive manner. If the compiled pattern was processed by pcre2_jit_com- |
| pile(), and just-in-time compiling was successful, and the options |
| passed to pcre2_match() were not incompatible, the matching process |
| uses the JIT-compiled code instead of the match() function. In this |
| case, the memory requirements are handled entirely differently. See the |
| pcre2jit documentation for details. |
| |
| The pcre2_dfa_match() function operates in a different way to |
| pcre2_match(), and uses recursion only when there is a regular expres- |
| sion recursion or subroutine call in the pattern. This includes the |
| processing of assertion and "once-only" subpatterns, which are handled |
| like subroutine calls. Normally, these are never very deep, and the |
| limit on the complexity of pcre2_dfa_match() is controlled by the |
| amount of workspace it is given. However, it is possible to write pat- |
| terns with runaway infinite recursions; such patterns will cause |
| pcre2_dfa_match() to run out of stack. At present, there is no protec- |
| tion against this. |
| |
| The comments that follow do NOT apply to pcre2_dfa_match(); they are |
| relevant only for pcre2_match() without the JIT optimization. |
| |
| Reducing pcre2_match()'s stack usage |
| |
| You can often reduce the amount of recursion, and therefore the amount |
| of stack used, by modifying the pattern that is being matched. Con- |
| sider, for example, this pattern: |
| |
| ([^<]|<(?!inet))+ |
| |
| It matches from wherever it starts until it encounters "<inet" or the |
| end of the data, and is the kind of pattern that might be used when |
| processing an XML file. Each iteration of the outer parentheses matches |
| either one character that is not "<" or a "<" that is not followed by |
| "inet". However, each time a parenthesis is processed, a recursion |
| occurs, so this formulation uses a stack frame for each matched charac- |
| ter. For a long string, a lot of stack is required. Consider now this |
| rewritten pattern, which matches exactly the same strings: |
| |
| ([^<]++|<(?!inet))+ |
| |
| This uses very much less stack, because runs of characters that do not |
| contain "<" are "swallowed" in one item inside the parentheses. Recur- |
| sion happens only when a "<" character that is not followed by "inet" |
| is encountered (and we assume this is relatively rare). A possessive |
| quantifier is used to stop any backtracking into the runs of non-"<" |
| characters, but that is not related to stack usage. |
| |
| This example shows that one way of avoiding stack problems when match- |
| ing long subject strings is to write repeated parenthesized subpatterns |
| to match more than one character whenever possible. |
| |
| Compiling PCRE2 to use heap instead of stack for pcre2_match() |
| |
| In environments where stack memory is constrained, you might want to |
| compile PCRE2 to use heap memory instead of stack for remembering back- |
| up points when pcre2_match() is running. This makes it run more slowly, |
| however. Details of how to do this are given in the pcre2build documen- |
| tation. When built in this way, instead of using the stack, PCRE2 gets |
| memory for remembering backup points from the heap. By default, the |
| memory is obtained by calling the system malloc() function, but you can |
| arrange to supply your own memory management function. For details, see |
| the section entitled "The match context" in the pcre2api documentation. |
| Since the block sizes are always the same, it may be possible to imple- |
| ment customized a memory handler that is more efficient than the stan- |
| dard function. The memory blocks obtained for this purpose are retained |
| and re-used if possible while pcre2_match() is running. They are all |
| freed just before it exits. |
| |
| Limiting pcre2_match()'s stack usage |
| |
| You can set limits on the number of times the internal match() function |
| is called, both in total and recursively. If a limit is exceeded, |
| pcre2_match() returns an error code. Setting suitable limits should |
| prevent it from running out of stack. The default values of the limits |
| are very large, and unlikely ever to operate. They can be changed when |
| PCRE2 is built, and they can also be set when pcre2_match() is called. |
| For details of these interfaces, see the pcre2build documentation and |
| the section entitled "The match context" in the pcre2api documentation. |
| |
| As a very rough rule of thumb, you should reckon on about 500 bytes per |
| recursion. Thus, if you want to limit your stack usage to 8Mb, you |
| should set the limit at 16000 recursions. A 64Mb stack, on the other |
| hand, can support around 128000 recursions. |
| |
| The pcre2test test program has a modifier called "find_limits" which, |
| if applied to a subject line, causes it to find the smallest limits |
| that allow a a pattern to match. This is done by calling pcre2_match() |
| repeatedly with different limits. |
| |
| Changing stack size in Unix-like systems |
| |
| In Unix-like environments, there is not often a problem with the stack |
| unless very long strings are involved, though the default limit on |
| stack size varies from system to system. Values from 8Mb to 64Mb are |
| common. You can find your default limit by running the command: |
| |
| ulimit -s |
| |
| Unfortunately, the effect of running out of stack is often SIGSEGV, |
| though sometimes a more explicit error message is given. You can nor- |
| mally increase the limit on stack size by code such as this: |
| |
| struct rlimit rlim; |
| getrlimit(RLIMIT_STACK, &rlim); |
| rlim.rlim_cur = 100*1024*1024; |
| setrlimit(RLIMIT_STACK, &rlim); |
| |
| This reads the current limits (soft and hard) using getrlimit(), then |
| attempts to increase the soft limit to 100Mb using setrlimit(). You |
| must do this before calling pcre2_match(). |
| |
| Changing stack size in Mac OS X |
| |
| Using setrlimit(), as described above, should also work on Mac OS X. It |
| is also possible to set a stack size when linking a program. There is a |
| discussion about stack sizes in Mac OS X at this web site: |
| http://developer.apple.com/qa/qa2005/qa1419.html. |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 21 November 2014 |
| Copyright (c) 1997-2014 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |
| PCRE2SYNTAX(3) Library Functions Manual PCRE2SYNTAX(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| |
| PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY |
| |
| The full syntax and semantics of the regular expressions that are sup- |
| ported by PCRE2 are described in the pcre2pattern documentation. This |
| document contains a quick-reference summary of the syntax. |
| |
| |
| QUOTING |
| |
| \x where x is non-alphanumeric is a literal x |
| \Q...\E treat enclosed characters as literal |
| |
| |
| ESCAPED CHARACTERS |
| |
| This table applies to ASCII and Unicode environments. |
| |
| \a alarm, that is, the BEL character (hex 07) |
| \cx "control-x", where x is any ASCII printing character |
| \e escape (hex 1B) |
| \f form feed (hex 0C) |
| \n newline (hex 0A) |
| \r carriage return (hex 0D) |
| \t tab (hex 09) |
| \0dd character with octal code 0dd |
| \ddd character with octal code ddd, or backreference |
| \o{ddd..} character with octal code ddd.. |
| \U "U" if PCRE2_ALT_BSUX is set (otherwise is an error) |
| \uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set) |
| \xhh character with hex code hh |
| \x{hhh..} character with hex code hhh.. |
| |
| Note that \0dd is always an octal code. The treatment of backslash fol- |
| lowed by a non-zero digit is complicated; for details see the section |
| "Non-printing characters" in the pcre2pattern documentation, where |
| details of escape processing in EBCDIC environments are also given. |
| |
| When \x is not followed by {, from zero to two hexadecimal digits are |
| read, but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadec- |
| imal digits to be recognized as a hexadecimal escape; otherwise it |
| matches a literal "x". Likewise, if \u (in ALT_BSUX mode) is not fol- |
| lowed by four hexadecimal digits, it matches a literal "u". |
| |
| |
| CHARACTER TYPES |
| |
| . any character except newline; |
| in dotall mode, any character whatsoever |
| \C one code unit, even in UTF mode (best avoided) |
| \d a decimal digit |
| \D a character that is not a decimal digit |
| \h a horizontal white space character |
| \H a character that is not a horizontal white space character |
| \N a character that is not a newline |
| \p{xx} a character with the xx property |
| \P{xx} a character without the xx property |
| \R a newline sequence |
| \s a white space character |
| \S a character that is not a white space character |
| \v a vertical white space character |
| \V a character that is not a vertical white space character |
| \w a "word" character |
| \W a "non-word" character |
| \X a Unicode extended grapheme cluster |
| |
| \C is dangerous because it may leave the current matching point in the |
| middle of a UTF-8 or UTF-16 character. The application can lock out the |
| use of \C by setting the PCRE2_NEVER_BACKSLASH_C option. It is also |
| possible to build PCRE2 with the use of \C permanently disabled. |
| |
| By default, \d, \s, and \w match only ASCII characters, even in UTF-8 |
| mode or in the 16-bit and 32-bit libraries. However, if locale-specific |
| matching is happening, \s and \w may also match characters with code |
| points in the range 128-255. If the PCRE2_UCP option is set, the behav- |
| iour of these escape sequences is changed to use Unicode properties and |
| they match many more characters. |
| |
| |
| GENERAL CATEGORY PROPERTIES FOR \p and \P |
| |
| C Other |
| Cc Control |
| Cf Format |
| Cn Unassigned |
| Co Private use |
| Cs Surrogate |
| |
| L Letter |
| Ll Lower case letter |
| Lm Modifier letter |
| Lo Other letter |
| Lt Title case letter |
| Lu Upper case letter |
| L& Ll, Lu, or Lt |
| |
| M Mark |
| Mc Spacing mark |
| Me Enclosing mark |
| Mn Non-spacing mark |
| |
| N Number |
| Nd Decimal number |
| Nl Letter number |
| No Other number |
| |
| P Punctuation |
| Pc Connector punctuation |
| Pd Dash punctuation |
| Pe Close punctuation |
| Pf Final punctuation |
| Pi Initial punctuation |
| Po Other punctuation |
| Ps Open punctuation |
| |
| S Symbol |
| Sc Currency symbol |
| Sk Modifier symbol |
| Sm Mathematical symbol |
| So Other symbol |
| |
| Z Separator |
| Zl Line separator |
| Zp Paragraph separator |
| Zs Space separator |
| |
| |
| PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P |
| |
| Xan Alphanumeric: union of properties L and N |
| Xps POSIX space: property Z or tab, NL, VT, FF, CR |
| Xsp Perl space: property Z or tab, NL, VT, FF, CR |
| Xuc Univerally-named character: one that can be |
| represented by a Universal Character Name |
| Xwd Perl word: property Xan or underscore |
| |
| Perl and POSIX space are now the same. Perl added VT to its space char- |
| acter set at release 5.18. |
| |
| |
| SCRIPT NAMES FOR \p AND \P |
| |
| Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Balinese, |
| Bamum, Bassa_Vah, Batak, Bengali, Bopomofo, Brahmi, Braille, Buginese, |
| Buhid, Canadian_Aboriginal, Carian, Caucasian_Albanian, Chakma, Cham, |
| Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, |
| Devanagari, Duployan, Egyptian_Hieroglyphs, Elbasan, Ethiopic, Geor- |
| gian, Glagolitic, Gothic, Grantha, Greek, Gujarati, Gurmukhi, Han, |
| Hangul, Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited, |
| Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kan- |
| nada, Katakana, Kayah_Li, Kharoshthi, Khmer, Khojki, Khudawadi, Lao, |
| Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Maha- |
| jani, Malayalam, Mandaic, Manichaean, Meetei_Mayek, Mende_Kikakui, |
| Meroitic_Cursive, Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, |
| Multani, Myanmar, Nabataean, New_Tai_Lue, Nko, Ogham, Ol_Chiki, |
| Old_Hungarian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, |
| Old_South_Arabian, Old_Turkic, Oriya, Osmanya, Pahawh_Hmong, Palmyrene, |
| Pau_Cin_Hau, Phags_Pa, Phoenician, Psalter_Pahlavi, Rejang, Runic, |
| Samaritan, Saurashtra, Sharada, Shavian, Siddham, SignWriting, Sinhala, |
| Sora_Sompeng, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, |
| Tai_Le, Tai_Tham, Tai_Viet, Takri, Tamil, Telugu, Thaana, Thai, |
| Tibetan, Tifinagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi. |
| |
| |
| CHARACTER CLASSES |
| |
| [...] positive character class |
| [^...] negative character class |
| [x-y] range (can be used for hex characters) |
| [[:xxx:]] positive POSIX named set |
| [[:^xxx:]] negative POSIX named set |
| |
| alnum alphanumeric |
| alpha alphabetic |
| ascii 0-127 |
| blank space or tab |
| cntrl control character |
| digit decimal digit |
| graph printing, excluding space |
| lower lower case letter |
| print printing, including space |
| punct printing, excluding alphanumeric |
| space white space |
| upper upper case letter |
| word same as \w |
| xdigit hexadecimal digit |
| |
| In PCRE2, POSIX character set names recognize only ASCII characters by |
| default, but some of them use Unicode properties if PCRE2_UCP is set. |
| You can use \Q...\E inside a character class. |
| |
| |
| QUANTIFIERS |
| |
| ? 0 or 1, greedy |
| ?+ 0 or 1, possessive |
| ?? 0 or 1, lazy |
| * 0 or more, greedy |
| *+ 0 or more, possessive |
| *? 0 or more, lazy |
| + 1 or more, greedy |
| ++ 1 or more, possessive |
| +? 1 or more, lazy |
| {n} exactly n |
| {n,m} at least n, no more than m, greedy |
| {n,m}+ at least n, no more than m, possessive |
| {n,m}? at least n, no more than m, lazy |
| {n,} n or more, greedy |
| {n,}+ n or more, possessive |
| {n,}? n or more, lazy |
| |
| |
| ANCHORS AND SIMPLE ASSERTIONS |
| |
| \b word boundary |
| \B not a word boundary |
| ^ start of subject |
| also after an internal newline in multiline mode |
| (after any newline if PCRE2_ALT_CIRCUMFLEX is set) |
| \A start of subject |
| $ end of subject |
| also before newline at end of subject |
| also before internal newline in multiline mode |
| \Z end of subject |
| also before newline at end of subject |
| \z end of subject |
| \G first matching position in subject |
| |
| |
| MATCH POINT RESET |
| |
| \K reset start of match |
| |
| \K is honoured in positive assertions, but ignored in negative ones. |
| |
| |
| ALTERNATION |
| |
| expr|expr|expr... |
| |
| |
| CAPTURING |
| |
| (...) capturing group |
| (?<name>...) named capturing group (Perl) |
| (?'name'...) named capturing group (Perl) |
| (?P<name>...) named capturing group (Python) |
| (?:...) non-capturing group |
| (?|...) non-capturing group; reset group numbers for |
| capturing groups in each alternative |
| |
| |
| ATOMIC GROUPS |
| |
| (?>...) atomic, non-capturing group |
| |
| |
| COMMENT |
| |
| (?#....) comment (not nestable) |
| |
| |
| OPTION SETTING |
| |
| (?i) caseless |
| (?J) allow duplicate names |
| (?m) multiline |
| (?s) single line (dotall) |
| (?U) default ungreedy (lazy) |
| (?x) extended (ignore white space) |
| (?-...) unset option(s) |
| |
| The following are recognized only at the very start of a pattern or |
| after one of the newline or \R options with similar syntax. More than |
| one of them may appear. |
| |
| (*LIMIT_MATCH=d) set the match limit to d (decimal number) |
| (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number) |
| (*NOTEMPTY) set PCRE2_NOTEMPTY when matching |
| (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching |
| (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS) |
| (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR) |
| (*NO_JIT) disable JIT optimization |
| (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE) |
| (*UTF) set appropriate UTF mode for the library in use |
| (*UCP) set PCRE2_UCP (use Unicode properties for \d etc) |
| |
| Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of |
| the limits set by the caller of pcre2_match(), not increase them. The |
| application can lock out the use of (*UTF) and (*UCP) by setting the |
| PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at compile |
| time. |
| |
| |
| NEWLINE CONVENTION |
| |
| These are recognized only at the very start of the pattern or after |
| option settings with a similar syntax. |
| |
| (*CR) carriage return only |
| (*LF) linefeed only |
| (*CRLF) carriage return followed by linefeed |
| (*ANYCRLF) all three of the above |
| (*ANY) any Unicode newline sequence |
| |
| |
| WHAT \R MATCHES |
| |
| These are recognized only at the very start of the pattern or after |
| option setting with a similar syntax. |
| |
| (*BSR_ANYCRLF) CR, LF, or CRLF |
| (*BSR_UNICODE) any Unicode newline sequence |
| |
| |
| LOOKAHEAD AND LOOKBEHIND ASSERTIONS |
| |
| (?=...) positive look ahead |
| (?!...) negative look ahead |
| (?<=...) positive look behind |
| (?<!...) negative look behind |
| |
| Each top-level branch of a look behind must be of a fixed length. |
| |
| |
| BACKREFERENCES |
| |
| \n reference by number (can be ambiguous) |
| \gn reference by number |
| \g{n} reference by number |
| \g{-n} relative reference by number |
| \k<name> reference by name (Perl) |
| \k'name' reference by name (Perl) |
| \g{name} reference by name (Perl) |
| \k{name} reference by name (.NET) |
| (?P=name) reference by name (Python) |
| |
| |
| SUBROUTINE REFERENCES (POSSIBLY RECURSIVE) |
| |
| (?R) recurse whole pattern |
| (?n) call subpattern by absolute number |
| (?+n) call subpattern by relative number |
| (?-n) call subpattern by relative number |
| (?&name) call subpattern by name (Perl) |
| (?P>name) call subpattern by name (Python) |
| \g<name> call subpattern by name (Oniguruma) |
| \g'name' call subpattern by name (Oniguruma) |
| \g<n> call subpattern by absolute number (Oniguruma) |
| \g'n' call subpattern by absolute number (Oniguruma) |
| \g<+n> call subpattern by relative number (PCRE2 extension) |
| \g'+n' call subpattern by relative number (PCRE2 extension) |
| \g<-n> call subpattern by relative number (PCRE2 extension) |
| \g'-n' call subpattern by relative number (PCRE2 extension) |
| |
| |
| CONDITIONAL PATTERNS |
| |
| (?(condition)yes-pattern) |
| (?(condition)yes-pattern|no-pattern) |
| |
| (?(n) absolute reference condition |
| (?(+n) relative reference condition |
| (?(-n) relative reference condition |
| (?(<name>) named reference condition (Perl) |
| (?('name') named reference condition (Perl) |
| (?(name) named reference condition (PCRE2) |
| (?(R) overall recursion condition |
| (?(Rn) specific group recursion condition |
| (?(R&name) specific recursion condition |
| (?(DEFINE) define subpattern for reference |
| (?(VERSION[>]=n.m) test PCRE2 version |
| (?(assert) assertion condition |
| |
| |
| BACKTRACKING CONTROL |
| |
| The following act immediately they are reached: |
| |
| (*ACCEPT) force successful match |
| (*FAIL) force backtrack; synonym (*F) |
| (*MARK:NAME) set name to be passed back; synonym (*:NAME) |
| |
| The following act only when a subsequent match failure causes a back- |
| track to reach them. They all force a match failure, but they differ in |
| what happens afterwards. Those that advance the start-of-match point do |
| so only if the pattern is not anchored. |
| |
| (*COMMIT) overall failure, no advance of starting point |
| (*PRUNE) advance to next starting character |
| (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE) |
| (*SKIP) advance to current matching position |
| (*SKIP:NAME) advance to position corresponding to an earlier |
| (*MARK:NAME); if not found, the (*SKIP) is ignored |
| (*THEN) local failure, backtrack to next alternation |
| (*THEN:NAME) equivalent to (*MARK:NAME)(*THEN) |
| |
| |
| CALLOUTS |
| |
| (?C) callout (assumed number 0) |
| (?Cn) callout with numerical data n |
| (?C"text") callout with string data |
| |
| The allowed string delimiters are ` ' " ^ % # $ (which are the same for |
| the start and the end), and the starting delimiter { matched with the |
| ending delimiter }. To encode the ending delimiter within the string, |
| double it. |
| |
| |
| SEE ALSO |
| |
| pcre2pattern(3), pcre2api(3), pcre2callout(3), pcre2matching(3), |
| pcre2(3). |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 16 October 2015 |
| Copyright (c) 1997-2015 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |
| PCRE2UNICODE(3) Library Functions Manual PCRE2UNICODE(3) |
| |
| |
| |
| NAME |
| PCRE - Perl-compatible regular expressions (revised API) |
| |
| UNICODE AND UTF SUPPORT |
| |
| When PCRE2 is built with Unicode support (which is the default), it has |
| knowledge of Unicode character properties and can process text strings |
| in UTF-8, UTF-16, or UTF-32 format (depending on the code unit width). |
| However, by default, PCRE2 assumes that one code unit is one character. |
| To process a pattern as a UTF string, where a character may require |
| more than one code unit, you must call pcre2_compile() with the |
| PCRE2_UTF option flag, or the pattern must start with the sequence |
| (*UTF). When either of these is the case, both the pattern and any sub- |
| ject strings that are matched against it are treated as UTF strings |
| instead of strings of individual one-code-unit characters. |
| |
| If you do not need Unicode support you can build PCRE2 without it, in |
| which case the library will be smaller. |
| |
| |
| UNICODE PROPERTY SUPPORT |
| |
| When PCRE2 is built with Unicode support, the escape sequences \p{..}, |
| \P{..}, and \X can be used. The Unicode properties that can be tested |
| are limited to the general category properties such as Lu for an upper |
| case letter or Nd for a decimal number, the Unicode script names such |
| as Arabic or Han, and the derived properties Any and L&. Full lists are |
| given in the pcre2pattern and pcre2syntax documentation. Only the short |
| names for properties are supported. For example, \p{L} matches a let- |
| ter. Its Perl synonym, \p{Letter}, is not supported. Furthermore, in |
| Perl, many properties may optionally be prefixed by "Is", for compati- |
| bility with Perl 5.6. PCRE does not support this. |
| |
| |
| WIDE CHARACTERS AND UTF MODES |
| |
| Codepoints less than 256 can be specified in patterns by either braced |
| or unbraced hexadecimal escape sequences (for example, \x{b3} or \xb3). |
| Larger values have to use braced sequences. Unbraced octal code points |
| up to \777 are also recognized; larger ones can be coded using \o{...}. |
| |
| In UTF modes, repeat quantifiers apply to complete UTF characters, not |
| to individual code units. |
| |
| In UTF modes, the dot metacharacter matches one UTF character instead |
| of a single code unit. |
| |
| The escape sequence \C can be used to match a single code unit, in a |
| UTF mode, but its use can lead to some strange effects because it |
| breaks up multi-unit characters (see the description of \C in the |
| pcre2pattern documentation). The use of \C is not supported by the |
| alternative matching function pcre2_dfa_match() when in UTF mode. Its |
| use provokes a match-time error. The JIT optimization also does not |
| support \C in UTF mode. If JIT optimization is requested for a UTF |
| pattern that contains \C, it will not succeed, and so the matching will |
| be carried out by the normal interpretive function. |
| |
| The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test |
| characters of any code value, but, by default, the characters that |
| PCRE2 recognizes as digits, spaces, or word characters remain the same |
| set as in non-UTF mode, all with code points less than 256. This |
| remains true even when PCRE2 is built to include Unicode support, |
| because to do otherwise would slow down matching in many common cases. |
| Note that this also applies to \b and \B, because they are defined in |
| terms of \w and \W. If you want to test for a wider sense of, say, |
| "digit", you can use explicit Unicode property tests such as \p{Nd}. |
| Alternatively, if you set the PCRE2_UCP option, the way that the char- |
| acter escapes work is changed so that Unicode properties are used to |
| determine which characters match. There are more details in the section |
| on generic character types in the pcre2pattern documentation. |
| |
| Similarly, characters that match the POSIX named character classes are |
| all low-valued characters, unless the PCRE2_UCP option is set. |
| |
| However, the special horizontal and vertical white space matching |
| escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char- |
| acters, whether or not PCRE2_UCP is set. |
| |
| Case-insensitive matching in UTF mode makes use of Unicode properties. |
| A few Unicode characters such as Greek sigma have more than two code- |
| points that are case-equivalent, and these are treated as such. |
| |
| |
| VALIDITY OF UTF STRINGS |
| |
| When the PCRE2_UTF option is set, the strings passed as patterns and |
| subjects are (by default) checked for validity on entry to the relevant |
| functions. If an invalid UTF string is passed, an negative error code |
| is returned. The code unit offset to the offending character can be |
| extracted from the match data block by calling pcre2_get_startchar(), |
| which is used for this purpose after a UTF error. |
| |
| UTF-16 and UTF-32 strings can indicate their endianness by special code |
| knows as a byte-order mark (BOM). The PCRE2 functions do not handle |
| this, expecting strings to be in host byte order. |
| |
| A UTF string is checked before any other processing takes place. In the |
| case of pcre2_match() and pcre2_dfa_match() calls with a non-zero |
| starting offset, the check is applied only to that part of the subject |
| that could be inspected during matching, and there is a check that the |
| starting offset points to the first code unit of a character or to the |
| end of the subject. If there are no lookbehind assertions in the pat- |
| tern, the check starts at the starting offset. Otherwise, it starts at |
| the length of the longest lookbehind before the starting offset, or at |
| the start of the subject if there are not that many characters before |
| the starting offset. Note that the sequences \b and \B are one-charac- |
| ter lookbehinds. |
| |
| In addition to checking the format of the string, there is a check to |
| ensure that all code points lie in the range U+0 to U+10FFFF, excluding |
| the surrogate area. The so-called "non-character" code points are not |
| excluded because Unicode corrigendum #9 makes it clear that they should |
| not be. |
| |
| Characters in the "Surrogate Area" of Unicode are reserved for use by |
| UTF-16, where they are used in pairs to encode code points with values |
| greater than 0xFFFF. The code points that are encoded by UTF-16 pairs |
| are available independently in the UTF-8 and UTF-32 encodings. (In |
| other words, the whole surrogate thing is a fudge for UTF-16 which |
| unfortunately messes up UTF-8 and UTF-32.) |
| |
| In some situations, you may already know that your strings are valid, |
| and therefore want to skip these checks in order to improve perfor- |
| mance, for example in the case of a long subject string that is being |
| scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at com- |
| pile time or at match time, PCRE2 assumes that the pattern or subject |
| it is given (respectively) contains only valid UTF code unit sequences. |
| |
| Passing PCRE2_NO_UTF_CHECK to pcre2_compile() just disables the check |
| for the pattern; it does not also apply to subject strings. If you want |
| to disable the check for a subject string you must pass this option to |
| pcre2_match() or pcre2_dfa_match(). |
| |
| If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the |
| result is undefined and your program may crash or loop indefinitely. |
| |
| Errors in UTF-8 strings |
| |
| The following negative error codes are given for invalid UTF-8 strings: |
| |
| PCRE2_ERROR_UTF8_ERR1 |
| PCRE2_ERROR_UTF8_ERR2 |
| PCRE2_ERROR_UTF8_ERR3 |
| PCRE2_ERROR_UTF8_ERR4 |
| PCRE2_ERROR_UTF8_ERR5 |
| |
| The string ends with a truncated UTF-8 character; the code specifies |
| how many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8 |
| characters to be no longer than 4 bytes, the encoding scheme (origi- |
| nally defined by RFC 2279) allows for up to 6 bytes, and this is |
| checked first; hence the possibility of 4 or 5 missing bytes. |
| |
| PCRE2_ERROR_UTF8_ERR6 |
| PCRE2_ERROR_UTF8_ERR7 |
| PCRE2_ERROR_UTF8_ERR8 |
| PCRE2_ERROR_UTF8_ERR9 |
| PCRE2_ERROR_UTF8_ERR10 |
| |
| The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of |
| the character do not have the binary value 0b10 (that is, either the |
| most significant bit is 0, or the next bit is 1). |
| |
| PCRE2_ERROR_UTF8_ERR11 |
| PCRE2_ERROR_UTF8_ERR12 |
| |
| A character that is valid by the RFC 2279 rules is either 5 or 6 bytes |
| long; these code points are excluded by RFC 3629. |
| |
| PCRE2_ERROR_UTF8_ERR13 |
| |
| A 4-byte character has a value greater than 0x10fff; these code points |
| are excluded by RFC 3629. |
| |
| PCRE2_ERROR_UTF8_ERR14 |
| |
| A 3-byte character has a value in the range 0xd800 to 0xdfff; this |
| range of code points are reserved by RFC 3629 for use with UTF-16, and |
| so are excluded from UTF-8. |
| |
| PCRE2_ERROR_UTF8_ERR15 |
| PCRE2_ERROR_UTF8_ERR16 |
| PCRE2_ERROR_UTF8_ERR17 |
| PCRE2_ERROR_UTF8_ERR18 |
| PCRE2_ERROR_UTF8_ERR19 |
| |
| A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes |
| for a value that can be represented by fewer bytes, which is invalid. |
| For example, the two bytes 0xc0, 0xae give the value 0x2e, whose cor- |
| rect coding uses just one byte. |
| |
| PCRE2_ERROR_UTF8_ERR20 |
| |
| The two most significant bits of the first byte of a character have the |
| binary value 0b10 (that is, the most significant bit is 1 and the sec- |
| ond is 0). Such a byte can only validly occur as the second or subse- |
| quent byte of a multi-byte character. |
| |
| PCRE2_ERROR_UTF8_ERR21 |
| |
| The first byte of a character has the value 0xfe or 0xff. These values |
| can never occur in a valid UTF-8 string. |
| |
| Errors in UTF-16 strings |
| |
| The following negative error codes are given for invalid UTF-16 |
| strings: |
| |
| PCRE_UTF16_ERR1 Missing low surrogate at end of string |
| PCRE_UTF16_ERR2 Invalid low surrogate follows high surrogate |
| PCRE_UTF16_ERR3 Isolated low surrogate |
| |
| |
| Errors in UTF-32 strings |
| |
| The following negative error codes are given for invalid UTF-32 |
| strings: |
| |
| PCRE_UTF32_ERR1 Surrogate character (range from 0xd800 to 0xdfff) |
| PCRE_UTF32_ERR2 Code point is greater than 0x10ffff |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 16 October 2015 |
| Copyright (c) 1997-2015 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |