| ----------------------------------------------------------------------------- |
| This file contains a concatenation of the PCRE2 man pages, converted to plain |
| text format for ease of searching with a text editor, or for use on systems |
| that do not have a man page processor. The small individual files that give |
| synopses of each function in the library have not been included. Neither has |
| the pcre2demo program. There are separate text files for the pcre2grep and |
| pcre2test commands. |
| ----------------------------------------------------------------------------- |
| |
| |
| PCRE2(3) Library Functions Manual PCRE2(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| |
| INTRODUCTION |
| |
| PCRE2 is the name used for a revised API for the PCRE library, which is |
| a set of functions, written in C, that implement regular expression |
| pattern matching using the same syntax and semantics as Perl, with just |
| a few differences. Some features that appeared in Python and the origi- |
| nal PCRE before they appeared in Perl are also available using the |
| Python syntax. There is also some support for one or two .NET and Onig- |
| uruma syntax items, and there are options for requesting some minor |
| changes that give better ECMAScript (aka JavaScript) compatibility. |
| |
| The source code for PCRE2 can be compiled to support 8-bit, 16-bit, or |
| 32-bit code units, which means that up to three separate libraries may |
| be installed. The original work to extend PCRE to 16-bit and 32-bit |
| code units was done by Zoltan Herczeg and Christian Persch, respec- |
| tively. In all three cases, strings can be interpreted either as one |
| character per code unit, or as UTF-encoded Unicode, with support for |
| Unicode general category properties. Unicode support is optional at |
| build time (but is the default). However, processing strings as UTF |
| code units must be enabled explicitly at run time. The version of Uni- |
| code in use can be discovered by running |
| |
| pcre2test -C |
| |
| The three libraries contain identical sets of functions, with names |
| ending in _8, _16, or _32, respectively (for example, pcre2_com- |
| pile_8()). However, by defining PCRE2_CODE_UNIT_WIDTH to be 8, 16, or |
| 32, a program that uses just one code unit width can be written using |
| generic names such as pcre2_compile(), and the documentation is written |
| assuming that this is the case. |
| |
| In addition to the Perl-compatible matching function, PCRE2 contains an |
| alternative function that matches the same compiled patterns in a dif- |
| ferent way. In certain circumstances, the alternative function has some |
| advantages. For a discussion of the two matching algorithms, see the |
| pcre2matching page. |
| |
| Details of exactly which Perl regular expression features are and are |
| not supported by PCRE2 are given in separate documents. See the |
| pcre2pattern and pcre2compat pages. There is a syntax summary in the |
| pcre2syntax page. |
| |
| Some features of PCRE2 can be included, excluded, or changed when the |
| library is built. The pcre2_config() function makes it possible for a |
| client to discover which features are available. The features them- |
| selves are described in the pcre2build page. Documentation about build- |
| ing PCRE2 for various operating systems can be found in the README and |
| NON-AUTOTOOLS_BUILD files in the source distribution. |
| |
| The libraries contains a number of undocumented internal functions and |
| data tables that are used by more than one of the exported external |
| functions, but which are not intended for use by external callers. |
| Their names all begin with "_pcre2", which hopefully will not provoke |
| any name clashes. In some environments, it is possible to control which |
| external symbols are exported when a shared library is built, and in |
| these cases the undocumented symbols are not exported. |
| |
| |
| SECURITY CONSIDERATIONS |
| |
| If you are using PCRE2 in a non-UTF application that permits users to |
| supply arbitrary patterns for compilation, you should be aware of a |
| feature that allows users to turn on UTF support from within a pattern. |
| For example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8 |
| mode, which interprets patterns and subjects as strings of UTF-8 code |
| units instead of individual 8-bit characters. This causes both the pat- |
| tern and any data against which it is matched to be checked for UTF-8 |
| validity. If the data string is very long, such a check might use suf- |
| ficiently many resources as to cause your application to lose perfor- |
| mance. |
| |
| One way of guarding against this possibility is to use the pcre2_pat- |
| tern_info() function to check the compiled pattern's options for |
| PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when |
| calling pcre2_compile(). This causes an compile time error if a pattern |
| contains a UTF-setting sequence. |
| |
| The use of Unicode properties for character types such as \d can also |
| be enabled from within the pattern, by specifying "(*UCP)". This fea- |
| ture can be disallowed by setting the PCRE2_NEVER_UCP option. |
| |
| If your application is one that supports UTF, be aware that validity |
| checking can take time. If the same data string is to be matched many |
| times, you can use the PCRE2_NO_UTF_CHECK option for the second and |
| subsequent matches to avoid running redundant checks. |
| |
| The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead |
| to problems, because it may leave the current matching point in the |
| middle of a multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C |
| option can be used by an application to lock out the use of \C, causing |
| a compile-time error if it is encountered. It is also possible to build |
| PCRE2 with the use of \C permanently disabled. |
| |
| Another way that performance can be hit is by running a pattern that |
| has a very large search tree against a string that will never match. |
| Nested unlimited repeats in a pattern are a common example. PCRE2 pro- |
| vides some protection against this: see the pcre2_set_match_limit() |
| function in the pcre2api page. |
| |
| |
| USER DOCUMENTATION |
| |
| The user documentation for PCRE2 comprises a number of different sec- |
| tions. In the "man" format, each of these is a separate "man page". In |
| the HTML format, each is a separate page, linked from the index page. |
| In the plain text format, the descriptions of the pcre2grep and |
| pcre2test programs are in files called pcre2grep.txt and pcre2test.txt, |
| respectively. The remaining sections, except for the pcre2demo section |
| (which is a program listing), and the short pages for individual func- |
| tions, are concatenated in pcre2.txt, for ease of searching. The sec- |
| tions are as follows: |
| |
| pcre2 this document |
| pcre2-config show PCRE2 installation configuration information |
| pcre2api details of PCRE2's native C API |
| pcre2build building PCRE2 |
| pcre2callout details of the callout feature |
| pcre2compat discussion of Perl compatibility |
| pcre2demo a demonstration C program that uses PCRE2 |
| pcre2grep description of the pcre2grep command (8-bit only) |
| pcre2jit discussion of just-in-time optimization support |
| pcre2limits details of size and other limits |
| pcre2matching discussion of the two matching algorithms |
| pcre2partial details of the partial matching facility |
| pcre2pattern syntax and semantics of supported regular |
| expression patterns |
| pcre2perform discussion of performance issues |
| pcre2posix the POSIX-compatible C API for the 8-bit library |
| pcre2sample discussion of the pcre2demo program |
| pcre2stack discussion of stack usage |
| pcre2syntax quick syntax reference |
| pcre2test description of the pcre2test command |
| pcre2unicode discussion of Unicode and UTF support |
| |
| In the "man" and HTML formats, there is also a short page for each C |
| library function, listing its arguments and results. |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| University Computing Service |
| Cambridge, England. |
| |
| Putting an actual email address here is a spam magnet. If you want to |
| email me, use my two initials, followed by the two digits 10, at the |
| domain cam.ac.uk. |
| |
| |
| REVISION |
| |
| Last updated: 16 October 2015 |
| Copyright (c) 1997-2015 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |
| PCRE2API(3) Library Functions Manual PCRE2API(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| |
| #include <pcre2.h> |
| |
| PCRE2 is a new API for PCRE. This document contains a description of |
| all its functions. See the pcre2 document for an overview of all the |
| PCRE2 documentation. |
| |
| |
| PCRE2 NATIVE API BASIC FUNCTIONS |
| |
| pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length, |
| uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset, |
| pcre2_compile_context *ccontext); |
| |
| void pcre2_code_free(pcre2_code *code); |
| |
| pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize, |
| pcre2_general_context *gcontext); |
| |
| pcre2_match_data *pcre2_match_data_create_from_pattern( |
| const pcre2_code *code, pcre2_general_context *gcontext); |
| |
| int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject, |
| PCRE2_SIZE length, PCRE2_SIZE startoffset, |
| uint32_t options, pcre2_match_data *match_data, |
| pcre2_match_context *mcontext); |
| |
| int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject, |
| PCRE2_SIZE length, PCRE2_SIZE startoffset, |
| uint32_t options, pcre2_match_data *match_data, |
| pcre2_match_context *mcontext, |
| int *workspace, PCRE2_SIZE wscount); |
| |
| void pcre2_match_data_free(pcre2_match_data *match_data); |
| |
| |
| PCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS |
| |
| PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data); |
| |
| uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data); |
| |
| PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data); |
| |
| PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data); |
| |
| |
| PCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS |
| |
| pcre2_general_context *pcre2_general_context_create( |
| void *(*private_malloc)(PCRE2_SIZE, void *), |
| void (*private_free)(void *, void *), void *memory_data); |
| |
| pcre2_general_context *pcre2_general_context_copy( |
| pcre2_general_context *gcontext); |
| |
| void pcre2_general_context_free(pcre2_general_context *gcontext); |
| |
| |
| PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS |
| |
| pcre2_compile_context *pcre2_compile_context_create( |
| pcre2_general_context *gcontext); |
| |
| pcre2_compile_context *pcre2_compile_context_copy( |
| pcre2_compile_context *ccontext); |
| |
| void pcre2_compile_context_free(pcre2_compile_context *ccontext); |
| |
| int pcre2_set_bsr(pcre2_compile_context *ccontext, |
| uint32_t value); |
| |
| int pcre2_set_character_tables(pcre2_compile_context *ccontext, |
| const unsigned char *tables); |
| |
| int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext, |
| PCRE2_SIZE value); |
| |
| int pcre2_set_newline(pcre2_compile_context *ccontext, |
| uint32_t value); |
| |
| int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext, |
| uint32_t value); |
| |
| int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext, |
| int (*guard_function)(uint32_t, void *), void *user_data); |
| |
| |
| PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS |
| |
| pcre2_match_context *pcre2_match_context_create( |
| pcre2_general_context *gcontext); |
| |
| pcre2_match_context *pcre2_match_context_copy( |
| pcre2_match_context *mcontext); |
| |
| void pcre2_match_context_free(pcre2_match_context *mcontext); |
| |
| int pcre2_set_callout(pcre2_match_context *mcontext, |
| int (*callout_function)(pcre2_callout_block *, void *), |
| void *callout_data); |
| |
| int pcre2_set_match_limit(pcre2_match_context *mcontext, |
| uint32_t value); |
| |
| int pcre2_set_offset_limit(pcre2_match_context *mcontext, |
| PCRE2_SIZE value); |
| |
| int pcre2_set_recursion_limit(pcre2_match_context *mcontext, |
| uint32_t value); |
| |
| int pcre2_set_recursion_memory_management( |
| pcre2_match_context *mcontext, |
| void *(*private_malloc)(PCRE2_SIZE, void *), |
| void (*private_free)(void *, void *), void *memory_data); |
| |
| |
| PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS |
| |
| int pcre2_substring_copy_byname(pcre2_match_data *match_data, |
| PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen); |
| |
| int pcre2_substring_copy_bynumber(pcre2_match_data *match_data, |
| uint32_t number, PCRE2_UCHAR *buffer, |
| PCRE2_SIZE *bufflen); |
| |
| void pcre2_substring_free(PCRE2_UCHAR *buffer); |
| |
| int pcre2_substring_get_byname(pcre2_match_data *match_data, |
| PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen); |
| |
| int pcre2_substring_get_bynumber(pcre2_match_data *match_data, |
| uint32_t number, PCRE2_UCHAR **bufferptr, |
| PCRE2_SIZE *bufflen); |
| |
| int pcre2_substring_length_byname(pcre2_match_data *match_data, |
| PCRE2_SPTR name, PCRE2_SIZE *length); |
| |
| int pcre2_substring_length_bynumber(pcre2_match_data *match_data, |
| uint32_t number, PCRE2_SIZE *length); |
| |
| int pcre2_substring_nametable_scan(const pcre2_code *code, |
| PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last); |
| |
| int pcre2_substring_number_from_name(const pcre2_code *code, |
| PCRE2_SPTR name); |
| |
| void pcre2_substring_list_free(PCRE2_SPTR *list); |
| |
| int pcre2_substring_list_get(pcre2_match_data *match_data, |
| PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr); |
| |
| |
| PCRE2 NATIVE API STRING SUBSTITUTION FUNCTION |
| |
| int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject, |
| PCRE2_SIZE length, PCRE2_SIZE startoffset, |
| uint32_t options, pcre2_match_data *match_data, |
| pcre2_match_context *mcontext, PCRE2_SPTR replacementzfP, |
| PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer, |
| PCRE2_SIZE *outlengthptr); |
| |
| |
| PCRE2 NATIVE API JIT FUNCTIONS |
| |
| int pcre2_jit_compile(pcre2_code *code, uint32_t options); |
| |
| int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject, |
| PCRE2_SIZE length, PCRE2_SIZE startoffset, |
| uint32_t options, pcre2_match_data *match_data, |
| pcre2_match_context *mcontext); |
| |
| void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext); |
| |
| pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize, |
| PCRE2_SIZE maxsize, pcre2_general_context *gcontext); |
| |
| void pcre2_jit_stack_assign(pcre2_match_context *mcontext, |
| pcre2_jit_callback callback_function, void *callback_data); |
| |
| void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack); |
| |
| |
| PCRE2 NATIVE API SERIALIZATION FUNCTIONS |
| |
| int32_t pcre2_serialize_decode(pcre2_code **codes, |
| int32_t number_of_codes, const uint8_t *bytes, |
| pcre2_general_context *gcontext); |
| |
| int32_t pcre2_serialize_encode(const pcre2_code **codes, |
| int32_t number_of_codes, uint8_t **serialized_bytes, |
| PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext); |
| |
| void pcre2_serialize_free(uint8_t *bytes); |
| |
| int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes); |
| |
| |
| PCRE2 NATIVE API AUXILIARY FUNCTIONS |
| |
| int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer, |
| PCRE2_SIZE bufflen); |
| |
| const unsigned char *pcre2_maketables(pcre2_general_context *gcontext); |
| |
| int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where); |
| |
| int pcre2_callout_enumerate(const pcre2_code *code, |
| int (*callback)(pcre2_callout_enumerate_block *, void *), |
| void *user_data); |
| |
| int pcre2_config(uint32_t what, void *where); |
| |
| |
| PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES |
| |
| There are three PCRE2 libraries, supporting 8-bit, 16-bit, and 32-bit |
| code units, respectively. However, there is just one header file, |
| pcre2.h. This contains the function prototypes and other definitions |
| for all three libraries. One, two, or all three can be installed simul- |
| taneously. On Unix-like systems the libraries are called libpcre2-8, |
| libpcre2-16, and libpcre2-32, and they can also co-exist with the orig- |
| inal PCRE libraries. |
| |
| Character strings are passed to and from a PCRE2 library as a sequence |
| of unsigned integers in code units of the appropriate width. Every |
| PCRE2 function comes in three different forms, one for each library, |
| for example: |
| |
| pcre2_compile_8() |
| pcre2_compile_16() |
| pcre2_compile_32() |
| |
| There are also three different sets of data types: |
| |
| PCRE2_UCHAR8, PCRE2_UCHAR16, PCRE2_UCHAR32 |
| PCRE2_SPTR8, PCRE2_SPTR16, PCRE2_SPTR32 |
| |
| The UCHAR types define unsigned code units of the appropriate widths. |
| For example, PCRE2_UCHAR16 is usually defined as `uint16_t'. The SPTR |
| types are constant pointers to the equivalent UCHAR types, that is, |
| they are pointers to vectors of unsigned code units. |
| |
| Many applications use only one code unit width. For their convenience, |
| macros are defined whose names are the generic forms such as pcre2_com- |
| pile() and PCRE2_SPTR. These macros use the value of the macro |
| PCRE2_CODE_UNIT_WIDTH to generate the appropriate width-specific func- |
| tion and macro names. PCRE2_CODE_UNIT_WIDTH is not defined by default. |
| An application must define it to be 8, 16, or 32 before including |
| pcre2.h in order to make use of the generic names. |
| |
| Applications that use more than one code unit width can be linked with |
| more than one PCRE2 library, but must define PCRE2_CODE_UNIT_WIDTH to |
| be 0 before including pcre2.h, and then use the real function names. |
| Any code that is to be included in an environment where the value of |
| PCRE2_CODE_UNIT_WIDTH is unknown should also use the real function |
| names. (Unfortunately, it is not possible in C code to save and restore |
| the value of a macro.) |
| |
| If PCRE2_CODE_UNIT_WIDTH is not defined before including pcre2.h, a |
| compiler error occurs. |
| |
| When using multiple libraries in an application, you must take care |
| when processing any particular pattern to use only functions from a |
| single library. For example, if you want to run a match using a pat- |
| tern that was compiled with pcre2_compile_16(), you must do so with |
| pcre2_match_16(), not pcre2_match_8(). |
| |
| In the function summaries above, and in the rest of this document and |
| other PCRE2 documents, functions and data types are described using |
| their generic names, without the 8, 16, or 32 suffix. |
| |
| |
| PCRE2 API OVERVIEW |
| |
| PCRE2 has its own native API, which is described in this document. |
| There are also some wrapper functions for the 8-bit library that corre- |
| spond to the POSIX regular expression API, but they do not give access |
| to all the functionality. They are described in the pcre2posix documen- |
| tation. Both these APIs define a set of C function calls. |
| |
| The native API C data types, function prototypes, option values, and |
| error codes are defined in the header file pcre2.h, which contains def- |
| initions of PCRE2_MAJOR and PCRE2_MINOR, the major and minor release |
| numbers for the library. Applications can use these to include support |
| for different releases of PCRE2. |
| |
| In a Windows environment, if you want to statically link an application |
| program against a non-dll PCRE2 library, you must define PCRE2_STATIC |
| before including pcre2.h. |
| |
| The functions pcre2_compile(), and pcre2_match() are used for compiling |
| and matching regular expressions in a Perl-compatible manner. A sample |
| program that demonstrates the simplest way of using them is provided in |
| the file called pcre2demo.c in the PCRE2 source distribution. A listing |
| of this program is given in the pcre2demo documentation, and the |
| pcre2sample documentation describes how to compile and run it. |
| |
| Just-in-time compiler support is an optional feature of PCRE2 that can |
| be built in appropriate hardware environments. It greatly speeds up the |
| matching performance of many patterns. Programs can request that it be |
| used if available, by calling pcre2_jit_compile() after a pattern has |
| been successfully compiled by pcre2_compile(). This does nothing if JIT |
| support is not available. |
| |
| More complicated programs might need to make use of the specialist |
| functions pcre2_jit_stack_create(), pcre2_jit_stack_free(), and |
| pcre2_jit_stack_assign() in order to control the JIT code's memory |
| usage. |
| |
| JIT matching is automatically used by pcre2_match() if it is available. |
| There is also a direct interface for JIT matching, which gives improved |
| performance. The JIT-specific functions are discussed in the pcre2jit |
| documentation. |
| |
| A second matching function, pcre2_dfa_match(), which is not Perl-com- |
| patible, is also provided. This uses a different algorithm for the |
| matching. The alternative algorithm finds all possible matches (at a |
| given point in the subject), and scans the subject just once (unless |
| there are lookbehind assertions). However, this algorithm does not |
| return captured substrings. A description of the two matching algo- |
| rithms and their advantages and disadvantages is given in the |
| pcre2matching documentation. There is no JIT support for |
| pcre2_dfa_match(). |
| |
| In addition to the main compiling and matching functions, there are |
| convenience functions for extracting captured substrings from a subject |
| string that has been matched by pcre2_match(). They are: |
| |
| pcre2_substring_copy_byname() |
| pcre2_substring_copy_bynumber() |
| pcre2_substring_get_byname() |
| pcre2_substring_get_bynumber() |
| pcre2_substring_list_get() |
| pcre2_substring_length_byname() |
| pcre2_substring_length_bynumber() |
| pcre2_substring_nametable_scan() |
| pcre2_substring_number_from_name() |
| |
| pcre2_substring_free() and pcre2_substring_list_free() are also pro- |
| vided, to free the memory used for extracted strings. |
| |
| The function pcre2_substitute() can be called to match a pattern and |
| return a copy of the subject string with substitutions for parts that |
| were matched. |
| |
| Finally, there are functions for finding out information about a com- |
| piled pattern (pcre2_pattern_info()) and about the configuration with |
| which PCRE2 was built (pcre2_config()). |
| |
| |
| STRING LENGTHS AND OFFSETS |
| |
| The PCRE2 API uses string lengths and offsets into strings of code |
| units in several places. These values are always of type PCRE2_SIZE, |
| which is an unsigned integer type, currently always defined as size_t. |
| The largest value that can be stored in such a type (that is |
| ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-terminated |
| strings and unset offsets. Therefore, the longest string that can be |
| handled is one less than this maximum. |
| |
| |
| NEWLINES |
| |
| PCRE2 supports five different conventions for indicating line breaks in |
| strings: a single CR (carriage return) character, a single LF (line- |
| feed) character, the two-character sequence CRLF, any of the three pre- |
| ceding, or any Unicode newline sequence. The Unicode newline sequences |
| are the three just mentioned, plus the single characters VT (vertical |
| tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line |
| separator, U+2028), and PS (paragraph separator, U+2029). |
| |
| Each of the first three conventions is used by at least one operating |
| system as its standard newline sequence. When PCRE2 is built, a default |
| can be specified. The default default is LF, which is the Unix stan- |
| dard. However, the newline convention can be changed by an application |
| when calling pcre2_compile(), or it can be specified by special text at |
| the start of the pattern itself; this overrides any other settings. See |
| the pcre2pattern page for details of the special character sequences. |
| |
| In the PCRE2 documentation the word "newline" is used to mean "the |
| character or pair of characters that indicate a line break". The choice |
| of newline convention affects the handling of the dot, circumflex, and |
| dollar metacharacters, the handling of #-comments in /x mode, and, when |
| CRLF is a recognized line ending sequence, the match position advance- |
| ment for a non-anchored pattern. There is more detail about this in the |
| section on pcre2_match() options below. |
| |
| The choice of newline convention does not affect the interpretation of |
| the \n or \r escape sequences, nor does it affect what \R matches; this |
| has its own separate convention. |
| |
| |
| MULTITHREADING |
| |
| In a multithreaded application it is important to keep thread-specific |
| data separate from data that can be shared between threads. The PCRE2 |
| library code itself is thread-safe: it contains no static or global |
| variables. The API is designed to be fairly simple for non-threaded |
| applications while at the same time ensuring that multithreaded appli- |
| cations can use it. |
| |
| There are several different blocks of data that are used to pass infor- |
| mation between the application and the PCRE2 libraries. |
| |
| (1) A pointer to the compiled form of a pattern is returned to the user |
| when pcre2_compile() is successful. The data in the compiled pattern is |
| fixed, and does not change when the pattern is matched. Therefore, it |
| is thread-safe, that is, the same compiled pattern can be used by more |
| than one thread simultaneously. An application can compile all its pat- |
| terns at the start, before forking off multiple threads that use them. |
| However, if the just-in-time optimization feature is being used, it |
| needs separate memory stack areas for each thread. See the pcre2jit |
| documentation for more details. |
| |
| (2) The next section below introduces the idea of "contexts" in which |
| PCRE2 functions are called. A context is nothing more than a collection |
| of parameters that control the way PCRE2 operates. Grouping a number of |
| parameters together in a context is a convenient way of passing them to |
| a PCRE2 function without using lots of arguments. The parameters that |
| are stored in contexts are in some sense "advanced features" of the |
| API. Many straightforward applications will not need to use contexts. |
| |
| In a multithreaded application, if the parameters in a context are val- |
| ues that are never changed, the same context can be used by all the |
| threads. However, if any thread needs to change any value in a context, |
| it must make its own thread-specific copy. |
| |
| (3) The matching functions need a block of memory for working space and |
| for storing the results of a match. This includes details of what was |
| matched, as well as additional information such as the name of a |
| (*MARK) setting. Each thread must provide its own version of this mem- |
| ory. |
| |
| |
| PCRE2 CONTEXTS |
| |
| Some PCRE2 functions have a lot of parameters, many of which are used |
| only by specialist applications, for example, those that use custom |
| memory management or non-standard character tables. To keep function |
| argument lists at a reasonable size, and at the same time to keep the |
| API extensible, "uncommon" parameters are passed to certain functions |
| in a context instead of directly. A context is just a block of memory |
| that holds the parameter values. Applications that do not need to |
| adjust any of the context parameters can pass NULL when a context |
| pointer is required. |
| |
| There are three different types of context: a general context that is |
| relevant for several PCRE2 operations, a compile-time context, and a |
| match-time context. |
| |
| The general context |
| |
| At present, this context just contains pointers to (and data for) |
| external memory management functions that are called from several |
| places in the PCRE2 library. The context is named `general' rather than |
| specifically `memory' because in future other fields may be added. If |
| you do not want to supply your own custom memory management functions, |
| you do not need to bother with a general context. A general context is |
| created by: |
| |
| pcre2_general_context *pcre2_general_context_create( |
| void *(*private_malloc)(PCRE2_SIZE, void *), |
| void (*private_free)(void *, void *), void *memory_data); |
| |
| The two function pointers specify custom memory management functions, |
| whose prototypes are: |
| |
| void *private_malloc(PCRE2_SIZE, void *); |
| void private_free(void *, void *); |
| |
| Whenever code in PCRE2 calls these functions, the final argument is the |
| value of memory_data. Either of the first two arguments of the creation |
| function may be NULL, in which case the system memory management func- |
| tions malloc() and free() are used. (This is not currently useful, as |
| there are no other fields in a general context, but in future there |
| might be.) The private_malloc() function is used (if supplied) to |
| obtain memory for storing the context, and all three values are saved |
| as part of the context. |
| |
| Whenever PCRE2 creates a data block of any kind, the block contains a |
| pointer to the free() function that matches the malloc() function that |
| was used. When the time comes to free the block, this function is |
| called. |
| |
| A general context can be copied by calling: |
| |
| pcre2_general_context *pcre2_general_context_copy( |
| pcre2_general_context *gcontext); |
| |
| The memory used for a general context should be freed by calling: |
| |
| void pcre2_general_context_free(pcre2_general_context *gcontext); |
| |
| |
| The compile context |
| |
| A compile context is required if you want to change the default values |
| of any of the following compile-time parameters: |
| |
| What \R matches (Unicode newlines or CR, LF, CRLF only) |
| PCRE2's character tables |
| The newline character sequence |
| The compile time nested parentheses limit |
| The maximum length of the pattern string |
| An external function for stack checking |
| |
| A compile context is also required if you are using custom memory man- |
| agement. If none of these apply, just pass NULL as the context argu- |
| ment of pcre2_compile(). |
| |
| A compile context is created, copied, and freed by the following func- |
| tions: |
| |
| pcre2_compile_context *pcre2_compile_context_create( |
| pcre2_general_context *gcontext); |
| |
| pcre2_compile_context *pcre2_compile_context_copy( |
| pcre2_compile_context *ccontext); |
| |
| void pcre2_compile_context_free(pcre2_compile_context *ccontext); |
| |
| A compile context is created with default values for its parameters. |
| These can be changed by calling the following functions, which return 0 |
| on success, or PCRE2_ERROR_BADDATA if invalid data is detected. |
| |
| int pcre2_set_bsr(pcre2_compile_context *ccontext, |
| uint32_t value); |
| |
| The value must be PCRE2_BSR_ANYCRLF, to specify that \R matches only |
| CR, LF, or CRLF, or PCRE2_BSR_UNICODE, to specify that \R matches any |
| Unicode line ending sequence. The value is used by the JIT compiler and |
| by the two interpreted matching functions, pcre2_match() and |
| pcre2_dfa_match(). |
| |
| int pcre2_set_character_tables(pcre2_compile_context *ccontext, |
| const unsigned char *tables); |
| |
| The value must be the result of a call to pcre2_maketables(), whose |
| only argument is a general context. This function builds a set of char- |
| acter tables in the current locale. |
| |
| int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext, |
| PCRE2_SIZE value); |
| |
| This sets a maximum length, in code units, for the pattern string that |
| is to be compiled. If the pattern is longer, an error is generated. |
| This facility is provided so that applications that accept patterns |
| from external sources can limit their size. The default is the largest |
| number that a PCRE2_SIZE variable can hold, which is effectively unlim- |
| ited. |
| |
| int pcre2_set_newline(pcre2_compile_context *ccontext, |
| uint32_t value); |
| |
| This specifies which characters or character sequences are to be recog- |
| nized as newlines. The value must be one of PCRE2_NEWLINE_CR (carriage |
| return only), PCRE2_NEWLINE_LF (linefeed only), PCRE2_NEWLINE_CRLF (the |
| two-character sequence CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any |
| of the above), or PCRE2_NEWLINE_ANY (any Unicode newline sequence). |
| |
| When a pattern is compiled with the PCRE2_EXTENDED option, the value of |
| this parameter affects the recognition of white space and the end of |
| internal comments starting with #. The value is saved with the compiled |
| pattern for subsequent use by the JIT compiler and by the two inter- |
| preted matching functions, pcre2_match() and pcre2_dfa_match(). |
| |
| int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext, |
| uint32_t value); |
| |
| This parameter ajusts the limit, set when PCRE2 is built (default 250), |
| on the depth of parenthesis nesting in a pattern. This limit stops |
| rogue patterns using up too much system stack when being compiled. |
| |
| int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext, |
| int (*guard_function)(uint32_t, void *), void *user_data); |
| |
| There is at least one application that runs PCRE2 in threads with very |
| limited system stack, where running out of stack is to be avoided at |
| all costs. The parenthesis limit above cannot take account of how much |
| stack is actually available. For a finer control, you can supply a |
| function that is called whenever pcre2_compile() starts to compile a |
| parenthesized part of a pattern. This function can check the actual |
| stack size (or anything else that it wants to, of course). |
| |
| The first argument to the callout function gives the current depth of |
| nesting, and the second is user data that is set up by the last argu- |
| ment of pcre2_set_compile_recursion_guard(). The callout function |
| should return zero if all is well, or non-zero to force an error. |
| |
| The match context |
| |
| A match context is required if you want to change the default values of |
| any of the following match-time parameters: |
| |
| A callout function |
| The offset limit for matching an unanchored pattern |
| The limit for calling match() (see below) |
| The limit for calling match() recursively |
| |
| A match context is also required if you are using custom memory manage- |
| ment. If none of these apply, just pass NULL as the context argument |
| of pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match(). |
| |
| A match context is created, copied, and freed by the following func- |
| tions: |
| |
| pcre2_match_context *pcre2_match_context_create( |
| pcre2_general_context *gcontext); |
| |
| pcre2_match_context *pcre2_match_context_copy( |
| pcre2_match_context *mcontext); |
| |
| void pcre2_match_context_free(pcre2_match_context *mcontext); |
| |
| A match context is created with default values for its parameters. |
| These can be changed by calling the following functions, which return 0 |
| on success, or PCRE2_ERROR_BADDATA if invalid data is detected. |
| |
| int pcre2_set_callout(pcre2_match_context *mcontext, |
| int (*callout_function)(pcre2_callout_block *, void *), |
| void *callout_data); |
| |
| This sets up a "callout" function, which PCRE2 will call at specified |
| points during a matching operation. Details are given in the pcre2call- |
| out documentation. |
| |
| int pcre2_set_offset_limit(pcre2_match_context *mcontext, |
| PCRE2_SIZE value); |
| |
| The offset_limit parameter limits how far an unanchored search can |
| advance in the subject string. The default value is PCRE2_UNSET. The |
| pcre2_match() and pcre2_dfa_match() functions return |
| PCRE2_ERROR_NOMATCH if a match with a starting point before or at the |
| given offset is not found. For example, if the pattern /abc/ is matched |
| against "123abc" with an offset limit less than 3, the result is |
| PCRE2_ERROR_NO_MATCH. A match can never be found if the startoffset |
| argument of pcre2_match() or pcre2_dfa_match() is greater than the off- |
| set limit. |
| |
| When using this facility, you must set PCRE2_USE_OFFSET_LIMIT when |
| calling pcre2_compile() so that when JIT is in use, different code can |
| be compiled. If a match is started with a non-default match limit when |
| PCRE2_USE_OFFSET_LIMIT is not set, an error is generated. |
| |
| The offset limit facility can be used to track progress when searching |
| large subject strings. See also the PCRE2_FIRSTLINE option, which |
| requires a match to start within the first line of the subject. If this |
| is set with an offset limit, a match must occur in the first line and |
| also within the offset limit. In other words, whichever limit comes |
| first is used. |
| |
| int pcre2_set_match_limit(pcre2_match_context *mcontext, |
| uint32_t value); |
| |
| The match_limit parameter provides a means of preventing PCRE2 from |
| using up too many resources when processing patterns that are not going |
| to match, but which have a very large number of possibilities in their |
| search trees. The classic example is a pattern that uses nested unlim- |
| ited repeats. |
| |
| Internally, pcre2_match() uses a function called match(), which it |
| calls repeatedly (sometimes recursively). The limit set by match_limit |
| is imposed on the number of times this function is called during a |
| match, which has the effect of limiting the amount of backtracking that |
| can take place. For patterns that are not anchored, the count restarts |
| from zero for each position in the subject string. This limit is not |
| relevant to pcre2_dfa_match(), which ignores it. |
| |
| When pcre2_match() is called with a pattern that was successfully pro- |
| cessed by pcre2_jit_compile(), the way in which matching is executed is |
| entirely different. However, there is still the possibility of runaway |
| matching that goes on for a very long time, and so the match_limit |
| value is also used in this case (but in a different way) to limit how |
| long the matching can continue. |
| |
| The default value for the limit can be set when PCRE2 is built; the |
| default default is 10 million, which handles all but the most extreme |
| cases. If the limit is exceeded, pcre2_match() returns |
| PCRE2_ERROR_MATCHLIMIT. A value for the match limit may also be sup- |
| plied by an item at the start of a pattern of the form |
| |
| (*LIMIT_MATCH=ddd) |
| |
| where ddd is a decimal number. However, such a setting is ignored |
| unless ddd is less than the limit set by the caller of pcre2_match() |
| or, if no such limit is set, less than the default. |
| |
| int pcre2_set_recursion_limit(pcre2_match_context *mcontext, |
| uint32_t value); |
| |
| The recursion_limit parameter is similar to match_limit, but instead of |
| limiting the total number of times that match() is called, it limits |
| the depth of recursion. The recursion depth is a smaller number than |
| the total number of calls, because not all calls to match() are recur- |
| sive. This limit is of use only if it is set smaller than match_limit. |
| |
| Limiting the recursion depth limits the amount of system stack that can |
| be used, or, when PCRE2 has been compiled to use memory on the heap |
| instead of the stack, the amount of heap memory that can be used. This |
| limit is not relevant, and is ignored, when matching is done using JIT |
| compiled code or by the pcre2_dfa_match() function. |
| |
| The default value for recursion_limit can be set when PCRE2 is built; |
| the default default is the same value as the default for match_limit. |
| If the limit is exceeded, pcre2_match() returns PCRE2_ERROR_RECURSION- |
| LIMIT. A value for the recursion limit may also be supplied by an item |
| at the start of a pattern of the form |
| |
| (*LIMIT_RECURSION=ddd) |
| |
| where ddd is a decimal number. However, such a setting is ignored |
| unless ddd is less than the limit set by the caller of pcre2_match() |
| or, if no such limit is set, less than the default. |
| |
| int pcre2_set_recursion_memory_management( |
| pcre2_match_context *mcontext, |
| void *(*private_malloc)(PCRE2_SIZE, void *), |
| void (*private_free)(void *, void *), void *memory_data); |
| |
| This function sets up two additional custom memory management functions |
| for use by pcre2_match() when PCRE2 is compiled to use the heap for |
| remembering backtracking data, instead of recursive function calls that |
| use the system stack. There is a discussion about PCRE2's stack usage |
| in the pcre2stack documentation. See the pcre2build documentation for |
| details of how to build PCRE2. |
| |
| Using the heap for recursion is a non-standard way of building PCRE2, |
| for use in environments that have limited stacks. Because of the |
| greater use of memory management, pcre2_match() runs more slowly. Func- |
| tions that are different to the general custom memory functions are |
| provided so that special-purpose external code can be used for this |
| case, because the memory blocks are all the same size. The blocks are |
| retained by pcre2_match() until it is about to exit so that they can be |
| re-used when possible during the match. In the absence of these func- |
| tions, the normal custom memory management functions are used, if sup- |
| plied, otherwise the system functions. |
| |
| |
| CHECKING BUILD-TIME OPTIONS |
| |
| int pcre2_config(uint32_t what, void *where); |
| |
| The function pcre2_config() makes it possible for a PCRE2 client to |
| discover which optional features have been compiled into the PCRE2 |
| library. The pcre2build documentation has more details about these |
| optional features. |
| |
| The first argument for pcre2_config() specifies which information is |
| required. The second argument is a pointer to memory into which the |
| information is placed. If NULL is passed, the function returns the |
| amount of memory that is needed for the requested information. For |
| calls that return numerical values, the value is in bytes; when |
| requesting these values, where should point to appropriately aligned |
| memory. For calls that return strings, the required length is given in |
| code units, not counting the terminating zero. |
| |
| When requesting information, the returned value from pcre2_config() is |
| non-negative on success, or the negative error code PCRE2_ERROR_BADOP- |
| TION if the value in the first argument is not recognized. The follow- |
| ing information is available: |
| |
| PCRE2_CONFIG_BSR |
| |
| The output is a uint32_t integer whose value indicates what character |
| sequences the \R escape sequence matches by default. A value of |
| PCRE2_BSR_UNICODE means that \R matches any Unicode line ending |
| sequence; a value of PCRE2_BSR_ANYCRLF means that \R matches only CR, |
| LF, or CRLF. The default can be overridden when a pattern is compiled. |
| |
| PCRE2_CONFIG_JIT |
| |
| The output is a uint32_t integer that is set to one if support for |
| just-in-time compiling is available; otherwise it is set to zero. |
| |
| PCRE2_CONFIG_JITTARGET |
| |
| The where argument should point to a buffer that is at least 48 code |
| units long. (The exact length required can be found by calling |
| pcre2_config() with where set to NULL.) The buffer is filled with a |
| string that contains the name of the architecture for which the JIT |
| compiler is configured, for example "x86 32bit (little endian + |
| unaligned)". If JIT support is not available, PCRE2_ERROR_BADOPTION is |
| returned, otherwise the number of code units used is returned. This is |
| the length of the string, plus one unit for the terminating zero. |
| |
| PCRE2_CONFIG_LINKSIZE |
| |
| The output is a uint32_t integer that contains the number of bytes used |
| for internal linkage in compiled regular expressions. When PCRE2 is |
| configured, the value can be set to 2, 3, or 4, with the default being |
| 2. This is the value that is returned by pcre2_config(). However, when |
| the 16-bit library is compiled, a value of 3 is rounded up to 4, and |
| when the 32-bit library is compiled, internal linkages always use 4 |
| bytes, so the configured value is not relevant. |
| |
| The default value of 2 for the 8-bit and 16-bit libraries is sufficient |
| for all but the most massive patterns, since it allows the size of the |
| compiled pattern to be up to 64K code units. Larger values allow larger |
| regular expressions to be compiled by those two libraries, but at the |
| expense of slower matching. |
| |
| PCRE2_CONFIG_MATCHLIMIT |
| |
| The output is a uint32_t integer that gives the default limit for the |
| number of internal matching function calls in a pcre2_match() execu- |
| tion. Further details are given with pcre2_match() below. |
| |
| PCRE2_CONFIG_NEWLINE |
| |
| The output is a uint32_t integer whose value specifies the default |
| character sequence that is recognized as meaning "newline". The values |
| are: |
| |
| PCRE2_NEWLINE_CR Carriage return (CR) |
| PCRE2_NEWLINE_LF Linefeed (LF) |
| PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF) |
| PCRE2_NEWLINE_ANY Any Unicode line ending |
| PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF |
| |
| The default should normally correspond to the standard sequence for |
| your operating system. |
| |
| PCRE2_CONFIG_PARENSLIMIT |
| |
| The output is a uint32_t integer that gives the maximum depth of nest- |
| ing of parentheses (of any kind) in a pattern. This limit is imposed to |
| cap the amount of system stack used when a pattern is compiled. It is |
| specified when PCRE2 is built; the default is 250. This limit does not |
| take into account the stack that may already be used by the calling |
| application. For finer control over compilation stack usage, see |
| pcre2_set_compile_recursion_guard(). |
| |
| PCRE2_CONFIG_RECURSIONLIMIT |
| |
| The output is a uint32_t integer that gives the default limit for the |
| depth of recursion when calling the internal matching function in a |
| pcre2_match() execution. Further details are given with pcre2_match() |
| below. |
| |
| PCRE2_CONFIG_STACKRECURSE |
| |
| The output is a uint32_t integer that is set to one if internal recur- |
| sion when running pcre2_match() is implemented by recursive function |
| calls that use the system stack to remember their state. This is the |
| usual way that PCRE2 is compiled. The output is zero if PCRE2 was com- |
| piled to use blocks of data on the heap instead of recursive function |
| calls. |
| |
| PCRE2_CONFIG_UNICODE_VERSION |
| |
| The where argument should point to a buffer that is at least 24 code |
| units long. (The exact length required can be found by calling |
| pcre2_config() with where set to NULL.) If PCRE2 has been compiled |
| without Unicode support, the buffer is filled with the text "Unicode |
| not supported". Otherwise, the Unicode version string (for example, |
| "8.0.0") is inserted. The number of code units used is returned. This |
| is the length of the string plus one unit for the terminating zero. |
| |
| PCRE2_CONFIG_UNICODE |
| |
| The output is a uint32_t integer that is set to one if Unicode support |
| is available; otherwise it is set to zero. Unicode support implies UTF |
| support. |
| |
| PCRE2_CONFIG_VERSION |
| |
| The where argument should point to a buffer that is at least 12 code |
| units long. (The exact length required can be found by calling |
| pcre2_config() with where set to NULL.) The buffer is filled with the |
| PCRE2 version string, zero-terminated. The number of code units used is |
| returned. This is the length of the string plus one unit for the termi- |
| nating zero. |
| |
| |
| COMPILING A PATTERN |
| |
| pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length, |
| uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset, |
| pcre2_compile_context *ccontext); |
| |
| void pcre2_code_free(pcre2_code *code); |
| |
| The pcre2_compile() function compiles a pattern into an internal form. |
| The pattern is defined by a pointer to a string of code units and a |
| length, If the pattern is zero-terminated, the length can be specified |
| as PCRE2_ZERO_TERMINATED. The function returns a pointer to a block of |
| memory that contains the compiled pattern and related data. The caller |
| must free the memory by calling pcre2_code_free() when it is no longer |
| needed. |
| |
| NOTE: When one of the matching functions is called, pointers to the |
| compiled pattern and the subject string are set in the match data block |
| so that they can be referenced by the extraction functions. After run- |
| ning a match, you must not free a compiled pattern (or a subject |
| string) until after all operations on the match data block have taken |
| place. |
| |
| If the compile context argument ccontext is NULL, memory for the com- |
| piled pattern is obtained by calling malloc(). Otherwise, it is |
| obtained from the same memory function that was used for the compile |
| context. |
| |
| The options argument contains various bit settings that affect the com- |
| pilation. It should be zero if no options are required. The available |
| options are described below. Some of them (in particular, those that |
| are compatible with Perl, but some others as well) can also be set and |
| unset from within the pattern (see the detailed description in the |
| pcre2pattern documentation). |
| |
| For those options that can be different in different parts of the pat- |
| tern, the contents of the options argument specifies their settings at |
| the start of compilation. The PCRE2_ANCHORED and PCRE2_NO_UTF_CHECK |
| options can be set at the time of matching as well as at compile time. |
| |
| Other, less frequently required compile-time parameters (for example, |
| the newline setting) can be provided in a compile context (as described |
| above). |
| |
| If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme- |
| diately. Otherwise, if compilation of a pattern fails, pcre2_compile() |
| returns NULL, having set these variables to an error code and an offset |
| (number of code units) within the pattern, respectively. The |
| pcre2_get_error_message() function provides a textual message for each |
| error code. Compilation errors are positive numbers, but UTF formatting |
| errors are negative numbers. For an invalid UTF-8 or UTF-16 string, the |
| offset is that of the first code unit of the failing character. |
| |
| Some errors are not detected until the whole pattern has been scanned; |
| in these cases, the offset passed back is the length of the pattern. |
| Note that the offset is in code units, not characters, even in a UTF |
| mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char- |
| acter. |
| |
| This code fragment shows a typical straightforward call to pcre2_com- |
| pile(): |
| |
| pcre2_code *re; |
| PCRE2_SIZE erroffset; |
| int errorcode; |
| re = pcre2_compile( |
| "^A.*Z", /* the pattern */ |
| PCRE2_ZERO_TERMINATED, /* the pattern is zero-terminated */ |
| 0, /* default options */ |
| &errorcode, /* for error code */ |
| &erroffset, /* for error offset */ |
| NULL); /* no compile context */ |
| |
| The following names for option bits are defined in the pcre2.h header |
| file: |
| |
| PCRE2_ANCHORED |
| |
| If this bit is set, the pattern is forced to be "anchored", that is, it |
| is constrained to match only at the first matching point in the string |
| that is being searched (the "subject string"). This effect can also be |
| achieved by appropriate constructs in the pattern itself, which is the |
| only way to do it in Perl. |
| |
| PCRE2_ALLOW_EMPTY_CLASS |
| |
| By default, for compatibility with Perl, a closing square bracket that |
| immediately follows an opening one is treated as a data character for |
| the class. When PCRE2_ALLOW_EMPTY_CLASS is set, it terminates the |
| class, which therefore contains no characters and so can never match. |
| |
| PCRE2_ALT_BSUX |
| |
| This option request alternative handling of three escape sequences, |
| which makes PCRE2's behaviour more like ECMAscript (aka JavaScript). |
| When it is set: |
| |
| (1) \U matches an upper case "U" character; by default \U causes a com- |
| pile time error (Perl uses \U to upper case subsequent characters). |
| |
| (2) \u matches a lower case "u" character unless it is followed by four |
| hexadecimal digits, in which case the hexadecimal number defines the |
| code point to match. By default, \u causes a compile time error (Perl |
| uses it to upper case the following character). |
| |
| (3) \x matches a lower case "x" character unless it is followed by two |
| hexadecimal digits, in which case the hexadecimal number defines the |
| code point to match. By default, as in Perl, a hexadecimal number is |
| always expected after \x, but it may have zero, one, or two digits (so, |
| for example, \xz matches a binary zero character followed by z). |
| |
| PCRE2_ALT_CIRCUMFLEX |
| |
| In multiline mode (when PCRE2_MULTILINE is set), the circumflex |
| metacharacter matches at the start of the subject (unless PCRE2_NOTBOL |
| is set), and also after any internal newline. However, it does not |
| match after a newline at the end of the subject, for compatibility with |
| Perl. If you want a multiline circumflex also to match after a termi- |
| nating newline, you must set PCRE2_ALT_CIRCUMFLEX. |
| |
| PCRE2_ALT_VERBNAMES |
| |
| By default, for compatibility with Perl, the name in any verb sequence |
| such as (*MARK:NAME) is any sequence of characters that does not |
| include a closing parenthesis. The name is not processed in any way, |
| and it is not possible to include a closing parenthesis in the name. |
| However, if the PCRE2_ALT_VERBNAMES option is set, normal backslash |
| processing is applied to verb names and only an unescaped closing |
| parenthesis terminates the name. A closing parenthesis can be included |
| in a name either as \) or between \Q and \E. If the PCRE2_EXTENDED |
| option is set, unescaped whitespace in verb names is skipped and #-com- |
| ments are recognized, exactly as in the rest of the pattern. |
| |
| PCRE2_AUTO_CALLOUT |
| |
| If this bit is set, pcre2_compile() automatically inserts callout |
| items, all with number 255, before each pattern item. For discussion of |
| the callout facility, see the pcre2callout documentation. |
| |
| PCRE2_CASELESS |
| |
| If this bit is set, letters in the pattern match both upper and lower |
| case letters in the subject. It is equivalent to Perl's /i option, and |
| it can be changed within a pattern by a (?i) option setting. |
| |
| PCRE2_DOLLAR_ENDONLY |
| |
| If this bit is set, a dollar metacharacter in the pattern matches only |
| at the end of the subject string. Without this option, a dollar also |
| matches immediately before a newline at the end of the string (but not |
| before any other newlines). The PCRE2_DOLLAR_ENDONLY option is ignored |
| if PCRE2_MULTILINE is set. There is no equivalent to this option in |
| Perl, and no way to set it within a pattern. |
| |
| PCRE2_DOTALL |
| |
| If this bit is set, a dot metacharacter in the pattern matches any |
| character, including one that indicates a newline. However, it only |
| ever matches one character, even if newlines are coded as CRLF. Without |
| this option, a dot does not match when the current position in the sub- |
| ject is at a newline. This option is equivalent to Perl's /s option, |
| and it can be changed within a pattern by a (?s) option setting. A neg- |
| ative class such as [^a] always matches newline characters, independent |
| of the setting of this option. |
| |
| PCRE2_DUPNAMES |
| |
| If this bit is set, names used to identify capturing subpatterns need |
| not be unique. This can be helpful for certain types of pattern when it |
| is known that only one instance of the named subpattern can ever be |
| matched. There are more details of named subpatterns below; see also |
| the pcre2pattern documentation. |
| |
| PCRE2_EXTENDED |
| |
| If this bit is set, most white space characters in the pattern are |
| totally ignored except when escaped or inside a character class. How- |
| ever, white space is not allowed within sequences such as (?> that |
| introduce various parenthesized subpatterns, nor within numerical quan- |
| tifiers such as {1,3}. Ignorable white space is permitted between an |
| item and a following quantifier and between a quantifier and a follow- |
| ing + that indicates possessiveness. |
| |
| PCRE2_EXTENDED also causes characters between an unescaped # outside a |
| character class and the next newline, inclusive, to be ignored, which |
| makes it possible to include comments inside complicated patterns. Note |
| that the end of this type of comment is a literal newline sequence in |
| the pattern; escape sequences that happen to represent a newline do not |
| count. PCRE2_EXTENDED is equivalent to Perl's /x option, and it can be |
| changed within a pattern by a (?x) option setting. |
| |
| Which characters are interpreted as newlines can be specified by a set- |
| ting in the compile context that is passed to pcre2_compile() or by a |
| special sequence at the start of the pattern, as described in the sec- |
| tion entitled "Newline conventions" in the pcre2pattern documentation. |
| A default is defined when PCRE2 is built. |
| |
| PCRE2_FIRSTLINE |
| |
| If this option is set, an unanchored pattern is required to match |
| before or at the first newline in the subject string, though the |
| matched text may continue over the newline. See also PCRE2_USE_OFF- |
| SET_LIMIT, which provides a more general limiting facility. If |
| PCRE2_FIRSTLINE is set with an offset limit, a match must occur in the |
| first line and also within the offset limit. In other words, whichever |
| limit comes first is used. |
| |
| PCRE2_MATCH_UNSET_BACKREF |
| |
| If this option is set, a back reference to an unset subpattern group |
| matches an empty string (by default this causes the current matching |
| alternative to fail). A pattern such as (\1)(a) succeeds when this |
| option is set (assuming it can find an "a" in the subject), whereas it |
| fails by default, for Perl compatibility. Setting this option makes |
| PCRE2 behave more like ECMAscript (aka JavaScript). |
| |
| PCRE2_MULTILINE |
| |
| By default, for the purposes of matching "start of line" and "end of |
| line", PCRE2 treats the subject string as consisting of a single line |
| of characters, even if it actually contains newlines. The "start of |
| line" metacharacter (^) matches only at the start of the string, and |
| the "end of line" metacharacter ($) matches only at the end of the |
| string, or before a terminating newline (except when PCRE2_DOL- |
| LAR_ENDONLY is set). Note, however, that unless PCRE2_DOTALL is set, |
| the "any character" metacharacter (.) does not match at a newline. This |
| behaviour (for ^, $, and dot) is the same as Perl. |
| |
| When PCRE2_MULTILINE it is set, the "start of line" and "end of line" |
| constructs match immediately following or immediately before internal |
| newlines in the subject string, respectively, as well as at the very |
| start and end. This is equivalent to Perl's /m option, and it can be |
| changed within a pattern by a (?m) option setting. Note that the "start |
| of line" metacharacter does not match after a newline at the end of the |
| subject, for compatibility with Perl. However, you can change this by |
| setting the PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in a |
| subject string, or no occurrences of ^ or $ in a pattern, setting |
| PCRE2_MULTILINE has no effect. |
| |
| PCRE2_NEVER_BACKSLASH_C |
| |
| This option locks out the use of \C in the pattern that is being com- |
| piled. This escape can cause unpredictable behaviour in UTF-8 or |
| UTF-16 modes, because it may leave the current matching point in the |
| middle of a multi-code-unit character. This option may be useful in |
| applications that process patterns from external sources. Note that |
| there is also a build-time option that permanently locks out the use of |
| \C. |
| |
| PCRE2_NEVER_UCP |
| |
| This option locks out the use of Unicode properties for handling \B, |
| \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes, as |
| described for the PCRE2_UCP option below. In particular, it prevents |
| the creator of the pattern from enabling this facility by starting the |
| pattern with (*UCP). This option may be useful in applications that |
| process patterns from external sources. The option combination PCRE_UCP |
| and PCRE_NEVER_UCP causes an error. |
| |
| PCRE2_NEVER_UTF |
| |
| This option locks out interpretation of the pattern as UTF-8, UTF-16, |
| or UTF-32, depending on which library is in use. In particular, it pre- |
| vents the creator of the pattern from switching to UTF interpretation |
| by starting the pattern with (*UTF). This option may be useful in |
| applications that process patterns from external sources. The combina- |
| tion of PCRE2_UTF and PCRE2_NEVER_UTF causes an error. |
| |
| PCRE2_NO_AUTO_CAPTURE |
| |
| If this option is set, it disables the use of numbered capturing paren- |
| theses in the pattern. Any opening parenthesis that is not followed by |
| ? behaves as if it were followed by ?: but named parentheses can still |
| be used for capturing (and they acquire numbers in the usual way). |
| There is no equivalent of this option in Perl. |
| |
| PCRE2_NO_AUTO_POSSESS |
| |
| If this option is set, it disables "auto-possessification", which is an |
| optimization that, for example, turns a+b into a++b in order to avoid |
| backtracks into a+ that can never be successful. However, if callouts |
| are in use, auto-possessification means that some callouts are never |
| taken. You can set this option if you want the matching functions to do |
| a full unoptimized search and run all the callouts, but it is mainly |
| provided for testing purposes. |
| |
| PCRE2_NO_DOTSTAR_ANCHOR |
| |
| If this option is set, it disables an optimization that is applied when |
| .* is the first significant item in a top-level branch of a pattern, |
| and all the other branches also start with .* or with \A or \G or ^. |
| The optimization is automatically disabled for .* if it is inside an |
| atomic group or a capturing group that is the subject of a back refer- |
| ence, or if the pattern contains (*PRUNE) or (*SKIP). When the opti- |
| mization is not disabled, such a pattern is automatically anchored if |
| PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set |
| for any ^ items. Otherwise, the fact that any match must start either |
| at the start of the subject or following a newline is remembered. Like |
| other optimizations, this can cause callouts to be skipped. |
| |
| PCRE2_NO_START_OPTIMIZE |
| |
| This is an option whose main effect is at matching time. It does not |
| change what pcre2_compile() generates, but it does affect the output of |
| the JIT compiler. |
| |
| There are a number of optimizations that may occur at the start of a |
| match, in order to speed up the process. For example, if it is known |
| that an unanchored match must start with a specific character, the |
| matching code searches the subject for that character, and fails imme- |
| diately if it cannot find it, without actually running the main match- |
| ing function. This means that a special item such as (*COMMIT) at the |
| start of a pattern is not considered until after a suitable starting |
| point for the match has been found. Also, when callouts or (*MARK) |
| items are in use, these "start-up" optimizations can cause them to be |
| skipped if the pattern is never actually used. The start-up optimiza- |
| tions are in effect a pre-scan of the subject that takes place before |
| the pattern is run. |
| |
| The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations, |
| possibly causing performance to suffer, but ensuring that in cases |
| where the result is "no match", the callouts do occur, and that items |
| such as (*COMMIT) and (*MARK) are considered at every possible starting |
| position in the subject string. |
| |
| Setting PCRE2_NO_START_OPTIMIZE may change the outcome of a matching |
| operation. Consider the pattern |
| |
| (*COMMIT)ABC |
| |
| When this is compiled, PCRE2 records the fact that a match must start |
| with the character "A". Suppose the subject string is "DEFABC". The |
| start-up optimization scans along the subject, finds "A" and runs the |
| first match attempt from there. The (*COMMIT) item means that the pat- |
| tern must match the current starting position, which in this case, it |
| does. However, if the same match is run with PCRE2_NO_START_OPTIMIZE |
| set, the initial scan along the subject string does not happen. The |
| first match attempt is run starting from "D" and when this fails, |
| (*COMMIT) prevents any further matches being tried, so the overall |
| result is "no match". There are also other start-up optimizations. For |
| example, a minimum length for the subject may be recorded. Consider the |
| pattern |
| |
| (*MARK:A)(X|Y) |
| |
| The minimum length for a match is one character. If the subject is |
| "ABC", there will be attempts to match "ABC", "BC", and "C". An attempt |
| to match an empty string at the end of the subject does not take place, |
| because PCRE2 knows that the subject is now too short, and so the |
| (*MARK) is never encountered. In this case, the optimization does not |
| affect the overall match result, which is still "no match", but it does |
| affect the auxiliary information that is returned. |
| |
| PCRE2_NO_UTF_CHECK |
| |
| When PCRE2_UTF is set, the validity of the pattern as a UTF string is |
| automatically checked. There are discussions about the validity of |
| UTF-8 strings, UTF-16 strings, and UTF-32 strings in the pcre2unicode |
| document. If an invalid UTF sequence is found, pcre2_compile() returns |
| a negative error code. |
| |
| If you know that your pattern is valid, and you want to skip this check |
| for performance reasons, you can set the PCRE2_NO_UTF_CHECK option. |
| When it is set, the effect of passing an invalid UTF string as a pat- |
| tern is undefined. It may cause your program to crash or loop. Note |
| that this option can also be passed to pcre2_match() and |
| pcre_dfa_match(), to suppress validity checking of the subject string. |
| |
| PCRE2_UCP |
| |
| This option changes the way PCRE2 processes \B, \b, \D, \d, \S, \s, \W, |
| \w, and some of the POSIX character classes. By default, only ASCII |
| characters are recognized, but if PCRE2_UCP is set, Unicode properties |
| are used instead to classify characters. More details are given in the |
| section on generic character types in the pcre2pattern page. If you set |
| PCRE2_UCP, matching one of the items it affects takes much longer. The |
| option is available only if PCRE2 has been compiled with Unicode sup- |
| port. |
| |
| PCRE2_UNGREEDY |
| |
| This option inverts the "greediness" of the quantifiers so that they |
| are not greedy by default, but become greedy if followed by "?". It is |
| not compatible with Perl. It can also be set by a (?U) option setting |
| within the pattern. |
| |
| PCRE2_USE_OFFSET_LIMIT |
| |
| This option must be set for pcre2_compile() if pcre2_set_offset_limit() |
| is going to be used to set a non-default offset limit in a match con- |
| text for matches that use this pattern. An error is generated if an |
| offset limit is set without this option. For more details, see the |
| description of pcre2_set_offset_limit() in the section that describes |
| match contexts. See also the PCRE2_FIRSTLINE option above. |
| |
| PCRE2_UTF |
| |
| This option causes PCRE2 to regard both the pattern and the subject |
| strings that are subsequently processed as strings of UTF characters |
| instead of single-code-unit strings. It is available when PCRE2 is |
| built to include Unicode support (which is the default). If Unicode |
| support is not available, the use of this option provokes an error. |
| Details of how this option changes the behaviour of PCRE2 are given in |
| the pcre2unicode page. |
| |
| |
| COMPILATION ERROR CODES |
| |
| There are over 80 positive error codes that pcre2_compile() may return |
| if it finds an error in the pattern. There are also some negative error |
| codes that are used for invalid UTF strings. These are the same as |
| given by pcre2_match() and pcre2_dfa_match(), and are described in the |
| pcre2unicode page. The pcre2_get_error_message() function can be called |
| to obtain a textual error message from any error code. |
| |
| |
| JUST-IN-TIME (JIT) COMPILATION |
| |
| int pcre2_jit_compile(pcre2_code *code, uint32_t options); |
| |
| int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject, |
| PCRE2_SIZE length, PCRE2_SIZE startoffset, |
| uint32_t options, pcre2_match_data *match_data, |
| pcre2_match_context *mcontext); |
| |
| void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext); |
| |
| pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize, |
| PCRE2_SIZE maxsize, pcre2_general_context *gcontext); |
| |
| void pcre2_jit_stack_assign(pcre2_match_context *mcontext, |
| pcre2_jit_callback callback_function, void *callback_data); |
| |
| void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack); |
| |
| These functions provide support for JIT compilation, which, if the |
| just-in-time compiler is available, further processes a compiled pat- |
| tern into machine code that executes much faster than the pcre2_match() |
| interpretive matching function. Full details are given in the pcre2jit |
| documentation. |
| |
| JIT compilation is a heavyweight optimization. It can take some time |
| for patterns to be analyzed, and for one-off matches and simple pat- |
| terns the benefit of faster execution might be offset by a much slower |
| compilation time. Most, but not all patterns can be optimized by the |
| JIT compiler. |
| |
| |
| LOCALE SUPPORT |
| |
| PCRE2 handles caseless matching, and determines whether characters are |
| letters, digits, or whatever, by reference to a set of tables, indexed |
| by character code point. This applies only to characters whose code |
| points are less than 256. By default, higher-valued code points never |
| match escapes such as \w or \d. However, if PCRE2 is built with UTF |
| support, all characters can be tested with \p and \P, or, alterna- |
| tively, the PCRE2_UCP option can be set when a pattern is compiled; |
| this causes \w and friends to use Unicode property support instead of |
| the built-in tables. |
| |
| The use of locales with Unicode is discouraged. If you are handling |
| characters with code points greater than 128, you should either use |
| Unicode support, or use locales, but not try to mix the two. |
| |
| PCRE2 contains an internal set of character tables that are used by |
| default. These are sufficient for many applications. Normally, the |
| internal tables recognize only ASCII characters. However, when PCRE2 is |
| built, it is possible to cause the internal tables to be rebuilt in the |
| default "C" locale of the local system, which may cause them to be dif- |
| ferent. |
| |
| The internal tables can be overridden by tables supplied by the appli- |
| cation that calls PCRE2. These may be created in a different locale |
| from the default. As more and more applications change to using Uni- |
| code, the need for this locale support is expected to die away. |
| |
| External tables are built by calling the pcre2_maketables() function, |
| in the relevant locale. The result can be passed to pcre2_compile() as |
| often as necessary, by creating a compile context and calling |
| pcre2_set_character_tables() to set the tables pointer therein. For |
| example, to build and use tables that are appropriate for the French |
| locale (where accented characters with values greater than 128 are |
| treated as letters), the following code could be used: |
| |
| setlocale(LC_CTYPE, "fr_FR"); |
| tables = pcre2_maketables(NULL); |
| ccontext = pcre2_compile_context_create(NULL); |
| pcre2_set_character_tables(ccontext, tables); |
| re = pcre2_compile(..., ccontext); |
| |
| The locale name "fr_FR" is used on Linux and other Unix-like systems; |
| if you are using Windows, the name for the French locale is "french". |
| It is the caller's responsibility to ensure that the memory containing |
| the tables remains available for as long as it is needed. |
| |
| The pointer that is passed (via the compile context) to pcre2_compile() |
| is saved with the compiled pattern, and the same tables are used by |
| pcre2_match() and pcre_dfa_match(). Thus, for any single pattern, com- |
| pilation, and matching all happen in the same locale, but different |
| patterns can be processed in different locales. |
| |
| |
| INFORMATION ABOUT A COMPILED PATTERN |
| |
| int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where); |
| |
| The pcre2_pattern_info() function returns general information about a |
| compiled pattern. For information about callouts, see the next section. |
| The first argument for pcre2_pattern_info() is a pointer to the com- |
| piled pattern. The second argument specifies which piece of information |
| is required, and the third argument is a pointer to a variable to |
| receive the data. If the third argument is NULL, the first argument is |
| ignored, and the function returns the size in bytes of the variable |
| that is required for the information requested. Otherwise, The yield of |
| the function is zero for success, or one of the following negative num- |
| bers: |
| |
| PCRE2_ERROR_NULL the argument code was NULL |
| PCRE2_ERROR_BADMAGIC the "magic number" was not found |
| PCRE2_ERROR_BADOPTION the value of what was invalid |
| PCRE2_ERROR_UNSET the requested field is not set |
| |
| The "magic number" is placed at the start of each compiled pattern as |
| an simple check against passing an arbitrary memory pointer. Here is a |
| typical call of pcre2_pattern_info(), to obtain the length of the com- |
| piled pattern: |
| |
| int rc; |
| size_t length; |
| rc = pcre2_pattern_info( |
| re, /* result of pcre2_compile() */ |
| PCRE2_INFO_SIZE, /* what is required */ |
| &length); /* where to put the data */ |
| |
| The possible values for the second argument are defined in pcre2.h, and |
| are as follows: |
| |
| PCRE2_INFO_ALLOPTIONS |
| PCRE2_INFO_ARGOPTIONS |
| |
| Return a copy of the pattern's options. The third argument should point |
| to a uint32_t variable. PCRE2_INFO_ARGOPTIONS returns exactly the |
| options that were passed to pcre2_compile(), whereas PCRE2_INFO_ALLOP- |
| TIONS returns the compile options as modified by any top-level option |
| settings such as (*UTF) at the start of the pattern itself. For exam- |
| ple, if the pattern /(*UTF)abc/ is compiled with the PCRE2_EXTENDED |
| option, the result is PCRE2_EXTENDED and PCRE2_UTF. |
| |
| A pattern compiled without PCRE2_ANCHORED is automatically anchored by |
| PCRE2 if the first significant item in every top-level branch is one of |
| the following: |
| |
| ^ unless PCRE2_MULTILINE is set |
| \A always |
| \G always |
| .* sometimes - see below |
| |
| When .* is the first significant item, anchoring is possible only when |
| all the following are true: |
| |
| .* is not in an atomic group |
| .* is not in a capturing group that is the subject |
| of a back reference |
| PCRE2_DOTALL is in force for .* |
| Neither (*PRUNE) nor (*SKIP) appears in the pattern. |
| PCRE2_NO_DOTSTAR_ANCHOR is not set. |
| |
| For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in |
| the options returned for PCRE2_INFO_ALLOPTIONS. |
| |
| PCRE2_INFO_BACKREFMAX |
| |
| Return the number of the highest back reference in the pattern. The |
| third argument should point to an uint32_t variable. Named subpatterns |
| acquire numbers as well as names, and these count towards the highest |
| back reference. Back references such as \4 or \g{12} match the cap- |
| tured characters of the given group, but in addition, the check that a |
| capturing group is set in a conditional subpattern such as (?(3)a|b) is |
| also a back reference. Zero is returned if there are no back refer- |
| ences. |
| |
| PCRE2_INFO_BSR |
| |
| The output is a uint32_t whose value indicates what character sequences |
| the \R escape sequence matches. A value of PCRE2_BSR_UNICODE means that |
| \R matches any Unicode line ending sequence; a value of PCRE2_BSR_ANY- |
| CRLF means that \R matches only CR, LF, or CRLF. |
| |
| PCRE2_INFO_CAPTURECOUNT |
| |
| Return the highest capturing subpattern number in the pattern. In pat- |
| terns where (?| is not used, this is also the total number of capturing |
| subpatterns. The third argument should point to an uint32_t variable. |
| |
| PCRE2_INFO_FIRSTBITMAP |
| |
| In the absence of a single first code unit for a non-anchored pattern, |
| pcre2_compile() may construct a 256-bit table that defines a fixed set |
| of values for the first code unit in any match. For example, a pattern |
| that starts with [abc] results in a table with three bits set. When |
| code unit values greater than 255 are supported, the flag bit for 255 |
| means "any code unit of value 255 or above". If such a table was con- |
| structed, a pointer to it is returned. Otherwise NULL is returned. The |
| third argument should point to an const uint8_t * variable. |
| |
| PCRE2_INFO_FIRSTCODETYPE |
| |
| Return information about the first code unit of any matched string, for |
| a non-anchored pattern. The third argument should point to an uint32_t |
| variable. If there is a fixed first value, for example, the letter "c" |
| from a pattern such as (cat|cow|coyote), 1 is returned, and the charac- |
| ter value can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is |
| no fixed first value, but it is known that a match can occur only at |
| the start of the subject or following a newline in the subject, 2 is |
| returned. Otherwise, and for anchored patterns, 0 is returned. |
| |
| PCRE2_INFO_FIRSTCODEUNIT |
| |
| Return the value of the first code unit of any matched string in the |
| situation where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0. |
| The third argument should point to an uint32_t variable. In the 8-bit |
| library, the value is always less than 256. In the 16-bit library the |
| value can be up to 0xffff. In the 32-bit library in UTF-32 mode the |
| value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32 |
| mode. |
| |
| PCRE2_INFO_HASBACKSLASHC |
| |
| Return 1 if the pattern contains any instances of \C, otherwise 0. The |
| third argument should point to an uint32_t variable. |
| |
| PCRE2_INFO_HASCRORLF |
| |
| Return 1 if the pattern contains any explicit matches for CR or LF |
| characters, otherwise 0. The third argument should point to an uint32_t |
| variable. An explicit match is either a literal CR or LF character, or |
| \r or \n. |
| |
| PCRE2_INFO_JCHANGED |
| |
| Return 1 if the (?J) or (?-J) option setting is used in the pattern, |
| otherwise 0. The third argument should point to an uint32_t variable. |
| (?J) and (?-J) set and unset the local PCRE2_DUPNAMES option, respec- |
| tively. |
| |
| PCRE2_INFO_JITSIZE |
| |
| If the compiled pattern was successfully processed by pcre2_jit_com- |
| pile(), return the size of the JIT compiled code, otherwise return |
| zero. The third argument should point to a size_t variable. |
| |
| PCRE2_INFO_LASTCODETYPE |
| |
| Returns 1 if there is a rightmost literal code unit that must exist in |
| any matched string, other than at its start. The third argument should |
| point to an uint32_t variable. If there is no such value, 0 is |
| returned. When 1 is returned, the code unit value itself can be |
| retrieved using PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last |
| literal value is recorded only if it follows something of variable |
| length. For example, for the pattern /^a\d+z\d+/ the returned value is |
| 1 (with "z" returned from PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ |
| the returned value is 0. |
| |
| PCRE2_INFO_LASTCODEUNIT |
| |
| Return the value of the rightmost literal data unit that must exist in |
| any matched string, other than at its start, if such a value has been |
| recorded. The third argument should point to an uint32_t variable. If |
| there is no such value, 0 is returned. |
| |
| PCRE2_INFO_MATCHEMPTY |
| |
| Return 1 if the pattern might match an empty string, otherwise 0. The |
| third argument should point to an uint32_t variable. When a pattern |
| contains recursive subroutine calls it is not always possible to deter- |
| mine whether or not it can match an empty string. PCRE2 takes a cau- |
| tious approach and returns 1 in such cases. |
| |
| PCRE2_INFO_MATCHLIMIT |
| |
| If the pattern set a match limit by including an item of the form |
| (*LIMIT_MATCH=nnnn) at the start, the value is returned. The third |
| argument should point to an unsigned 32-bit integer. If no such value |
| has been set, the call to pcre2_pattern_info() returns the error |
| PCRE2_ERROR_UNSET. |
| |
| PCRE2_INFO_MAXLOOKBEHIND |
| |
| Return the number of characters (not code units) in the longest lookbe- |
| hind assertion in the pattern. The third argument should point to an |
| unsigned 32-bit integer. This information is useful when doing multi- |
| segment matching using the partial matching facilities. Note that the |
| simple assertions \b and \B require a one-character lookbehind. \A also |
| registers a one-character lookbehind, though it does not actually |
| inspect the previous character. This is to ensure that at least one |
| character from the old segment is retained when a new segment is pro- |
| cessed. Otherwise, if there are no lookbehinds in the pattern, \A might |
| match incorrectly at the start of a new segment. |
| |
| PCRE2_INFO_MINLENGTH |
| |
| If a minimum length for matching subject strings was computed, its |
| value is returned. Otherwise the returned value is 0. The value is a |
| number of characters, which in UTF mode may be different from the num- |
| ber of code units. The third argument should point to an uint32_t |
| variable. The value is a lower bound to the length of any matching |
| string. There may not be any strings of that length that do actually |
| match, but every string that does match is at least that long. |
| |
| PCRE2_INFO_NAMECOUNT |
| PCRE2_INFO_NAMEENTRYSIZE |
| PCRE2_INFO_NAMETABLE |
| |
| PCRE2 supports the use of named as well as numbered capturing parenthe- |
| ses. The names are just an additional way of identifying the parenthe- |
| ses, which still acquire numbers. Several convenience functions such as |
| pcre2_substring_get_byname() are provided for extracting captured sub- |
| strings by name. It is also possible to extract the data directly, by |
| first converting the name to a number in order to access the correct |
| pointers in the output vector (described with pcre2_match() below). To |
| do the conversion, you need to use the name-to-number map, which is |
| described by these three values. |
| |
| The map consists of a number of fixed-size entries. PCRE2_INFO_NAME- |
| COUNT gives the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives |
| the size of each entry in code units; both of these return a uint32_t |
| value. The entry size depends on the length of the longest name. |
| |
| PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table. |
| This is a PCRE2_SPTR pointer to a block of code units. In the 8-bit |
| library, the first two bytes of each entry are the number of the cap- |
| turing parenthesis, most significant byte first. In the 16-bit library, |
| the pointer points to 16-bit code units, the first of which contains |
| the parenthesis number. In the 32-bit library, the pointer points to |
| 32-bit code units, the first of which contains the parenthesis number. |
| The rest of the entry is the corresponding name, zero terminated. |
| |
| The names are in alphabetical order. If (?| is used to create multiple |
| groups with the same number, as described in the section on duplicate |
| subpattern numbers in the pcre2pattern page, the groups may be given |
| the same name, but there is only one entry in the table. Different |
| names for groups of the same number are not permitted. |
| |
| Duplicate names for subpatterns with different numbers are permitted, |
| but only if PCRE2_DUPNAMES is set. They appear in the table in the |
| order in which they were found in the pattern. In the absence of (?| |
| this is the order of increasing number; when (?| is used this is not |
| necessarily the case because later subpatterns may have lower numbers. |
| |
| As a simple example of the name/number table, consider the following |
| pattern after compilation by the 8-bit library (assume PCRE2_EXTENDED |
| is set, so white space - including newlines - is ignored): |
| |
| (?<date> (?<year>(\d\d)?\d\d) - |
| (?<month>\d\d) - (?<day>\d\d) ) |
| |
| There are four named subpatterns, so the table has four entries, and |
| each entry in the table is eight bytes long. The table is as follows, |
| with non-printing bytes shows in hexadecimal, and undefined bytes shown |
| as ??: |
| |
| 00 01 d a t e 00 ?? |
| 00 05 d a y 00 ?? ?? |
| 00 04 m o n t h 00 |
| 00 02 y e a r 00 ?? |
| |
| When writing code to extract data from named subpatterns using the |
| name-to-number map, remember that the length of the entries is likely |
| to be different for each compiled pattern. |
| |
| PCRE2_INFO_NEWLINE |
| |
| The output is a uint32_t with one of the following values: |
| |
| PCRE2_NEWLINE_CR Carriage return (CR) |
| PCRE2_NEWLINE_LF Linefeed (LF) |
| PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF) |
| PCRE2_NEWLINE_ANY Any Unicode line ending |
| PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF |
| |
| This specifies the default character sequence that will be recognized |
| as meaning "newline" while matching. |
| |
| PCRE2_INFO_RECURSIONLIMIT |
| |
| If the pattern set a recursion limit by including an item of the form |
| (*LIMIT_RECURSION=nnnn) at the start, the value is returned. The third |
| argument should point to an unsigned 32-bit integer. If no such value |
| has been set, the call to pcre2_pattern_info() returns the error |
| PCRE2_ERROR_UNSET. |
| |
| PCRE2_INFO_SIZE |
| |
| Return the size of the compiled pattern in bytes (for all three |
| libraries). The third argument should point to a size_t variable. This |
| value includes the size of the general data block that precedes the |
| code units of the compiled pattern itself. The value that is used when |
| pcre2_compile() is getting memory in which to place the compiled pat- |
| tern may be slightly larger than the value returned by this option, |
| because there are cases where the code that calculates the size has to |
| over-estimate. Processing a pattern with the JIT compiler does not |
| alter the value returned by this option. |
| |
| |
| INFORMATION ABOUT A PATTERN'S CALLOUTS |
| |
| int pcre2_callout_enumerate(const pcre2_code *code, |
| int (*callback)(pcre2_callout_enumerate_block *, void *), |
| void *user_data); |
| |
| A script language that supports the use of string arguments in callouts |
| might like to scan all the callouts in a pattern before running the |
| match. This can be done by calling pcre2_callout_enumerate(). The first |
| argument is a pointer to a compiled pattern, the second points to a |
| callback function, and the third is arbitrary user data. The callback |
| function is called for every callout in the pattern in the order in |
| which they appear. Its first argument is a pointer to a callout enumer- |
| ation block, and its second argument is the user_data value that was |
| passed to pcre2_callout_enumerate(). The contents of the callout enu- |
| meration block are described in the pcre2callout documentation, which |
| also gives further details about callouts. |
| |
| |
| SERIALIZATION AND PRECOMPILING |
| |
| It is possible to save compiled patterns on disc or elsewhere, and |
| reload them later, subject to a number of restrictions. The functions |
| whose names begin with pcre2_serialize_ are used for this purpose. They |
| are described in the pcre2serialize documentation. |
| |
| |
| THE MATCH DATA BLOCK |
| |
| pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize, |
| pcre2_general_context *gcontext); |
| |
| pcre2_match_data *pcre2_match_data_create_from_pattern( |
| const pcre2_code *code, pcre2_general_context *gcontext); |
| |
| void pcre2_match_data_free(pcre2_match_data *match_data); |
| |
| Information about a successful or unsuccessful match is placed in a |
| match data block, which is an opaque structure that is accessed by |
| function calls. In particular, the match data block contains a vector |
| of offsets into the subject string that define the matched part of the |
| subject and any substrings that were captured. This is know as the |
| ovector. |
| |
| Before calling pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match() |
| you must create a match data block by calling one of the creation func- |
| tions above. For pcre2_match_data_create(), the first argument is the |
| number of pairs of offsets in the ovector. One pair of offsets is |
| required to identify the string that matched the whole pattern, with |
| another pair for each captured substring. For example, a value of 4 |
| creates enough space to record the matched portion of the subject plus |
| three captured substrings. A minimum of at least 1 pair is imposed by |
| pcre2_match_data_create(), so it is always possible to return the over- |
| all matched string. |
| |
| The second argument of pcre2_match_data_create() is a pointer to a gen- |
| eral context, which can specify custom memory management for obtaining |
| the memory for the match data block. If you are not using custom memory |
| management, pass NULL, which causes malloc() to be used. |
| |
| For pcre2_match_data_create_from_pattern(), the first argument is a |
| pointer to a compiled pattern. The ovector is created to be exactly the |
| right size to hold all the substrings a pattern might capture. The sec- |
| ond argument is again a pointer to a general context, but in this case |
| if NULL is passed, the memory is obtained using the same allocator that |
| was used for the compiled pattern (custom or default). |
| |
| A match data block can be used many times, with the same or different |
| compiled patterns. You can extract information from a match data block |
| after a match operation has finished, using functions that are |
| described in the sections on matched strings and other match data |
| below. |
| |
| When a call of pcre2_match() fails, valid data is available in the |
| match block only when the error is PCRE2_ERROR_NOMATCH, |
| PCRE2_ERROR_PARTIAL, or one of the error codes for an invalid UTF |
| string. Exactly what is available depends on the error, and is detailed |
| below. |
| |
| When one of the matching functions is called, pointers to the compiled |
| pattern and the subject string are set in the match data block so that |
| they can be referenced by the extraction functions. After running a |
| match, you must not free a compiled pattern or a subject string until |
| after all operations on the match data block (for that match) have |
| taken place. |
| |
| When a match data block itself is no longer needed, it should be freed |
| by calling pcre2_match_data_free(). |
| |
| |
| MATCHING A PATTERN: THE TRADITIONAL FUNCTION |
| |
| int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject, |
| PCRE2_SIZE length, PCRE2_SIZE startoffset, |
| uint32_t options, pcre2_match_data *match_data, |
| pcre2_match_context *mcontext); |
| |
| The function pcre2_match() is called to match a subject string against |
| a compiled pattern, which is passed in the code argument. You can call |
| pcre2_match() with the same code argument as many times as you like, in |
| order to find multiple matches in the subject string or to match dif- |
| ferent subject strings with the same pattern. |
| |
| This function is the main matching facility of the library, and it |
| operates in a Perl-like manner. For specialist use there is also an |
| alternative matching function, which is described below in the section |
| about the pcre2_dfa_match() function. |
| |
| Here is an example of a simple call to pcre2_match(): |
| |
| pcre2_match_data *md = pcre2_match_data_create(4, NULL); |
| int rc = pcre2_match( |
| re, /* result of pcre2_compile() */ |
| "some string", /* the subject string */ |
| 11, /* the length of the subject string */ |
| 0, /* start at offset 0 in the subject */ |
| 0, /* default options */ |
| match_data, /* the match data block */ |
| NULL); /* a match context; NULL means use defaults */ |
| |
| If the subject string is zero-terminated, the length can be given as |
| PCRE2_ZERO_TERMINATED. A match context must be provided if certain less |
| common matching parameters are to be changed. For details, see the sec- |
| tion on the match context above. |
| |
| The string to be matched by pcre2_match() |
| |
| The subject string is passed to pcre2_match() as a pointer in subject, |
| a length in length, and a starting offset in startoffset. The length |
| and offset are in code units, not characters. That is, they are in |
| bytes for the 8-bit library, 16-bit code units for the 16-bit library, |
| and 32-bit code units for the 32-bit library, whether or not UTF pro- |
| cessing is enabled. |
| |
| If startoffset is greater than the length of the subject, pcre2_match() |
| returns PCRE2_ERROR_BADOFFSET. When the starting offset is zero, the |
| search for a match starts at the beginning of the subject, and this is |
| by far the most common case. In UTF-8 or UTF-16 mode, the starting off- |
| set must point to the start of a character, or to the end of the sub- |
| ject (in UTF-32 mode, one code unit equals one character, so all off- |
| sets are valid). Like the pattern string, the subject may contain |
| binary zeroes. |
| |
| A non-zero starting offset is useful when searching for another match |
| in the same subject by calling pcre2_match() again after a previous |
| success. Setting startoffset differs from passing over a shortened |
| string and setting PCRE2_NOTBOL in the case of a pattern that begins |
| with any kind of lookbehind. For example, consider the pattern |
| |
| \Biss\B |
| |
| which finds occurrences of "iss" in the middle of words. (\B matches |
| only if the current position in the subject is not a word boundary.) |
| When applied to the string "Mississipi" the first call to pcre2_match() |
| finds the first occurrence. If pcre2_match() is called again with just |
| the remainder of the subject, namely "issipi", it does not match, |
| because \B is always false at the start of the subject, which is deemed |
| to be a word boundary. However, if pcre2_match() is passed the entire |
| string again, but with startoffset set to 4, it finds the second occur- |
| rence of "iss" because it is able to look behind the starting point to |
| discover that it is preceded by a letter. |
| |
| Finding all the matches in a subject is tricky when the pattern can |
| match an empty string. It is possible to emulate Perl's /g behaviour by |
| first trying the match again at the same offset, with the |
| PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED options, and then if that |
| fails, advancing the starting offset and trying an ordinary match |
| again. There is some code that demonstrates how to do this in the |
| pcre2demo sample program. In the most general case, you have to check |
| to see if the newline convention recognizes CRLF as a newline, and if |
| so, and the current character is CR followed by LF, advance the start- |
| ing offset by two characters instead of one. |
| |
| If a non-zero starting offset is passed when the pattern is anchored, |
| one attempt to match at the given offset is made. This can only succeed |
| if the pattern does not require the match to be at the start of the |
| subject. |
| |
| Option bits for pcre2_match() |
| |
| The unused bits of the options argument for pcre2_match() must be zero. |
| The only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL, |
| PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, |
| PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Their |
| action is described below. |
| |
| Setting PCRE2_ANCHORED at match time is not supported by the just-in- |
| time (JIT) compiler. If it is set, JIT matching is disabled and the |
| normal interpretive code in pcre2_match() is run. The remaining options |
| are supported for JIT matching. |
| |
| PCRE2_ANCHORED |
| |
| The PCRE2_ANCHORED option limits pcre2_match() to matching at the first |
| matching position. If a pattern was compiled with PCRE2_ANCHORED, or |
| turned out to be anchored by virtue of its contents, it cannot be made |
| unachored at matching time. Note that setting the option at match time |
| disables JIT matching. |
| |
| PCRE2_NOTBOL |
| |
| This option specifies that first character of the subject string is not |
| the beginning of a line, so the circumflex metacharacter should not |
| match before it. Setting this without having set PCRE2_MULTILINE at |
| compile time causes circumflex never to match. This option affects only |
| the behaviour of the circumflex metacharacter. It does not affect \A. |
| |
| PCRE2_NOTEOL |
| |
| This option specifies that the end of the subject string is not the end |
| of a line, so the dollar metacharacter should not match it nor (except |
| in multiline mode) a newline immediately before it. Setting this with- |
| out having set PCRE2_MULTILINE at compile time causes dollar never to |
| match. This option affects only the behaviour of the dollar metacharac- |
| ter. It does not affect \Z or \z. |
| |
| PCRE2_NOTEMPTY |
| |
| An empty string is not considered to be a valid match if this option is |
| set. If there are alternatives in the pattern, they are tried. If all |
| the alternatives match the empty string, the entire match fails. For |
| example, if the pattern |
| |
| a?b? |
| |
| is applied to a string not beginning with "a" or "b", it matches an |
| empty string at the start of the subject. With PCRE2_NOTEMPTY set, this |
| match is not valid, so pcre2_match() searches further into the string |
| for occurrences of "a" or "b". |
| |
| PCRE2_NOTEMPTY_ATSTART |
| |
| This is like PCRE2_NOTEMPTY, except that it locks out an empty string |
| match only at the first matching position, that is, at the start of the |
| subject plus the starting offset. An empty string match later in the |
| subject is permitted. If the pattern is anchored, such a match can |
| occur only if the pattern contains \K. |
| |
| PCRE2_NO_UTF_CHECK |
| |
| When PCRE2_UTF is set at compile time, the validity of the subject as a |
| UTF string is checked by default when pcre2_match() is subsequently |
| called. If a non-zero starting offset is given, the check is applied |
| only to that part of the subject that could be inspected during match- |
| ing, and there is a check that the starting offset points to the first |
| code unit of a character or to the end of the subject. If there are no |
| lookbehind assertions in the pattern, the check starts at the starting |
| offset. Otherwise, it starts at the length of the longest lookbehind |
| before the starting offset, or at the start of the subject if there are |
| not that many characters before the starting offset. Note that the |
| sequences \b and \B are one-character lookbehinds. |
| |
| The check is carried out before any other processing takes place, and a |
| negative error code is returned if the check fails. There are several |
| UTF error codes for each code unit width, corresponding to different |
| problems with the code unit sequence. There are discussions about the |
| validity of UTF-8 strings, UTF-16 strings, and UTF-32 strings in the |
| pcre2unicode page. |
| |
| If you know that your subject is valid, and you want to skip these |
| checks for performance reasons, you can set the PCRE2_NO_UTF_CHECK |
| option when calling pcre2_match(). You might want to do this for the |
| second and subsequent calls to pcre2_match() if you are making repeated |
| calls to find all the matches in a single subject string. |
| |
| NOTE: When PCRE2_NO_UTF_CHECK is set, the effect of passing an invalid |
| string as a subject, or an invalid value of startoffset, is undefined. |
| Your program may crash or loop indefinitely. |
| |
| PCRE2_PARTIAL_HARD |
| PCRE2_PARTIAL_SOFT |
| |
| These options turn on the partial matching feature. A partial match |
| occurs if the end of the subject string is reached successfully, but |
| there are not enough subject characters to complete the match. If this |
| happens when PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set, |
| matching continues by testing any remaining alternatives. Only if no |
| complete match can be found is PCRE2_ERROR_PARTIAL returned instead of |
| PCRE2_ERROR_NOMATCH. In other words, PCRE2_PARTIAL_SOFT specifies that |
| the caller is prepared to handle a partial match, but only if no com- |
| plete match can be found. |
| |
| If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this |
| case, if a partial match is found, pcre2_match() immediately returns |
| PCRE2_ERROR_PARTIAL, without considering any other alternatives. In |
| other words, when PCRE2_PARTIAL_HARD is set, a partial match is consid- |
| ered to be more important that an alternative complete match. |
| |
| There is a more detailed discussion of partial and multi-segment match- |
| ing, with examples, in the pcre2partial documentation. |
| |
| |
| NEWLINE HANDLING WHEN MATCHING |
| |
| When PCRE2 is built, a default newline convention is set; this is usu- |
| ally the standard convention for the operating system. The default can |
| be overridden in a compile context by calling pcre2_set_newline(). It |
| can also be overridden by starting a pattern string with, for example, |
| (*CRLF), as described in the section on newline conventions in the |
| pcre2pattern page. During matching, the newline choice affects the be- |
| haviour of the dot, circumflex, and dollar metacharacters. It may also |
| alter the way the match starting position is advanced after a match |
| failure for an unanchored pattern. |
| |
| When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is |
| set as the newline convention, and a match attempt for an unanchored |
| pattern fails when the current starting position is at a CRLF sequence, |
| and the pattern contains no explicit matches for CR or LF characters, |
| the match position is advanced by two characters instead of one, in |
| other words, to after the CRLF. |
| |
| The above rule is a compromise that makes the most common cases work as |
| expected. For example, if the pattern is .+A (and the PCRE2_DOTALL |
| option is not set), it does not match the string "\r\nA" because, after |
| failing at the start, it skips both the CR and the LF before retrying. |
| However, the pattern [\r\n]A does match that string, because it con- |
| tains an explicit CR or LF reference, and so advances only by one char- |
| acter after the first failure. |
| |
| An explicit match for CR of LF is either a literal appearance of one of |
| those characters in the pattern, or one of the \r or \n escape |
| sequences. Implicit matches such as [^X] do not count, nor does \s, |
| even though it includes CR and LF in the characters that it matches. |
| |
| Notwithstanding the above, anomalous effects may still occur when CRLF |
| is a valid newline sequence and explicit \r or \n escapes appear in the |
| pattern. |
| |
| |
| HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS |
| |
| uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data); |
| |
| PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data); |
| |
| In general, a pattern matches a certain portion of the subject, and in |
| addition, further substrings from the subject may be picked out by |
| parenthesized parts of the pattern. Following the usage in Jeffrey |
| Friedl's book, this is called "capturing" in what follows, and the |
| phrase "capturing subpattern" or "capturing group" is used for a frag- |
| ment of a pattern that picks out a substring. PCRE2 supports several |
| other kinds of parenthesized subpattern that do not cause substrings to |
| be captured. The pcre2_pattern_info() function can be used to find out |
| how many capturing subpatterns there are in a compiled pattern. |
| |
| You can use auxiliary functions for accessing captured substrings by |
| number or by name, as described in sections below. |
| |
| Alternatively, you can make direct use of the vector of PCRE2_SIZE val- |
| ues, called the ovector, which contains the offsets of captured |
| strings. It is part of the match data block. The function |
| pcre2_get_ovector_pointer() returns the address of the ovector, and |
| pcre2_get_ovector_count() returns the number of pairs of values it con- |
| tains. |
| |
| Within the ovector, the first in each pair of values is set to the off- |
| set of the first code unit of a substring, and the second is set to the |
| offset of the first code unit after the end of a substring. These val- |
| ues are always code unit offsets, not character offsets. That is, they |
| are byte offsets in the 8-bit library, 16-bit offsets in the 16-bit |
| library, and 32-bit offsets in the 32-bit library. |
| |
| After a partial match (error return PCRE2_ERROR_PARTIAL), only the |
| first pair of offsets (that is, ovector[0] and ovector[1]) are set. |
| They identify the part of the subject that was partially matched. See |
| the pcre2partial documentation for details of partial matching. |
| |
| After a successful match, the first pair of offsets identifies the por- |
| tion of the subject string that was matched by the entire pattern. The |
| next pair is used for the first capturing subpattern, and so on. The |
| value returned by pcre2_match() is one more than the highest numbered |
| pair that has been set. For example, if two substrings have been cap- |
| tured, the returned value is 3. If there are no capturing subpatterns, |
| the return value from a successful match is 1, indicating that just the |
| first pair of offsets has been set. |
| |
| If a pattern uses the \K escape sequence within a positive assertion, |
| the reported start of a successful match can be greater than the end of |
| the match. For example, if the pattern (?=ab\K) is matched against |
| "ab", the start and end offset values for the match are 2 and 0. |
| |
| If a capturing subpattern group is matched repeatedly within a single |
| match operation, it is the last portion of the subject that it matched |
| that is returned. |
| |
| If the ovector is too small to hold all the captured substring offsets, |
| as much as possible is filled in, and the function returns a value of |
| zero. If captured substrings are not of interest, pcre2_match() may be |
| called with a match data block whose ovector is of minimum length (that |
| is, one pair). However, if the pattern contains back references and the |
| ovector is not big enough to remember the related substrings, PCRE2 has |
| to get additional memory for use during matching. Thus it is usually |
| advisable to set up a match data block containing an ovector of reason- |
| able size. |
| |
| It is possible for capturing subpattern number n+1 to match some part |
| of the subject when subpattern n has not been used at all. For example, |
| if the string "abc" is matched against the pattern (a|(z))(bc) the |
| return from the function is 4, and subpatterns 1 and 3 are matched, but |
| 2 is not. When this happens, both values in the offset pairs corre- |
| sponding to unused subpatterns are set to PCRE2_UNSET. |
| |
| Offset values that correspond to unused subpatterns at the end of the |
| expression are also set to PCRE2_UNSET. For example, if the string |
| "abc" is matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 |
| are not matched. The return from the function is 2, because the high- |
| est used capturing subpattern number is 1. The offsets for for the sec- |
| ond and third capturing subpatterns (assuming the vector is large |
| enough, of course) are set to PCRE2_UNSET. |
| |
| Elements in the ovector that do not correspond to capturing parentheses |
| in the pattern are never changed. That is, if a pattern contains n cap- |
| turing parentheses, no more than ovector[0] to ovector[2n+1] are set by |
| pcre2_match(). The other elements retain whatever values they previ- |
| ously had. |
| |
| |
| OTHER INFORMATION ABOUT A MATCH |
| |
| PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data); |
| |
| PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data); |
| |
| As well as the offsets in the ovector, other information about a match |
| is retained in the match data block and can be retrieved by the above |
| functions in appropriate circumstances. If they are called at other |
| times, the result is undefined. |
| |
| After a successful match, a partial match (PCRE2_ERROR_PARTIAL), or a |
| failure to match (PCRE2_ERROR_NOMATCH), a (*MARK) name may be avail- |
| able, and pcre2_get_mark() can be called. It returns a pointer to the |
| zero-terminated name, which is within the compiled pattern. Otherwise |
| NULL is returned. The length of the (*MARK) name (excluding the termi- |
| nating zero) is stored in the code unit that preceeds the name. You |
| should use this instead of relying on the terminating zero if the |
| (*MARK) name might contain a binary zero. |
| |
| After a successful match, the (*MARK) name that is returned is the last |
| one encountered on the matching path through the pattern. After a "no |
| match" or a partial match, the last encountered (*MARK) name is |
| returned. For example, consider this pattern: |
| |
| ^(*MARK:A)((*MARK:B)a|b)c |
| |
| When it matches "bc", the returned mark is A. The B mark is "seen" in |
| the first branch of the group, but it is not on the matching path. On |
| the other hand, when this pattern fails to match "bx", the returned |
| mark is B. |
| |
| After a successful match, a partial match, or one of the invalid UTF |
| errors (for example, PCRE2_ERROR_UTF8_ERR5), pcre2_get_startchar() can |
| be called. After a successful or partial match it returns the code unit |
| offset of the character at which the match started. For a non-partial |
| match, this can be different to the value of ovector[0] if the pattern |
| contains the \K escape sequence. After a partial match, however, this |
| value is always the same as ovector[0] because \K does not affect the |
| result of a partial match. |
| |
| After a UTF check failure, pcre2_get_startchar() can be used to obtain |
| the code unit offset of the invalid UTF character. Details are given in |
| the pcre2unicode page. |
| |
| |
| ERROR RETURNS FROM pcre2_match() |
| |
| If pcre2_match() fails, it returns a negative number. This can be con- |
| verted to a text string by calling pcre2_get_error_message(). Negative |
| error codes are also returned by other functions, and are documented |
| with them. The codes are given names in the header file. If UTF check- |
| ing is in force and an invalid UTF subject string is detected, one of a |
| number of UTF-specific negative error codes is returned. Details are |
| given in the pcre2unicode page. The following are the other errors that |
| may be returned by pcre2_match(): |
| |
| PCRE2_ERROR_NOMATCH |
| |
| The subject string did not match the pattern. |
| |
| PCRE2_ERROR_PARTIAL |
| |
| The subject string did not match, but it did match partially. See the |
| pcre2partial documentation for details of partial matching. |
| |
| PCRE2_ERROR_BADMAGIC |
| |
| PCRE2 stores a 4-byte "magic number" at the start of the compiled code, |
| to catch the case when it is passed a junk pointer. This is the error |
| that is returned when the magic number is not present. |
| |
| PCRE2_ERROR_BADMODE |
| |
| This error is given when a pattern that was compiled by the 8-bit |
| library is passed to a 16-bit or 32-bit library function, or vice |
| versa. |
| |
| PCRE2_ERROR_BADOFFSET |
| |
| The value of startoffset was greater than the length of the subject. |
| |
| PCRE2_ERROR_BADOPTION |
| |
| An unrecognized bit was set in the options argument. |
| |
| PCRE2_ERROR_BADUTFOFFSET |
| |
| The UTF code unit sequence that was passed as a subject was checked and |
| found to be valid (the PCRE2_NO_UTF_CHECK option was not set), but the |
| value of startoffset did not point to the beginning of a UTF character |
| or the end of the subject. |
| |
| PCRE2_ERROR_CALLOUT |
| |
| This error is never generated by pcre2_match() itself. It is provided |
| for use by callout functions that want to cause pcre2_match() or |
| pcre2_callout_enumerate() to return a distinctive error code. See the |
| pcre2callout documentation for details. |
| |
| PCRE2_ERROR_INTERNAL |
| |
| An unexpected internal error has occurred. This error could be caused |
| by a bug in PCRE2 or by overwriting of the compiled pattern. |
| |
| PCRE2_ERROR_JIT_BADOPTION |
| |
| This error is returned when a pattern that was successfully studied |
| using JIT is being matched, but the matching mode (partial or complete |
| match) does not correspond to any JIT compilation mode. When the JIT |
| fast path function is used, this error may be also given for invalid |
| options. See the pcre2jit documentation for more details. |
| |
| PCRE2_ERROR_JIT_STACKLIMIT |
| |
| This error is returned when a pattern that was successfully studied |
| using JIT is being matched, but the memory available for the just-in- |
| time processing stack is not large enough. See the pcre2jit documenta- |
| tion for more details. |
| |
| PCRE2_ERROR_MATCHLIMIT |
| |
| The backtracking limit was reached. |
| |
| PCRE2_ERROR_NOMEMORY |
| |
| If a pattern contains back references, but the ovector is not big |
| enough to remember the referenced substrings, PCRE2 gets a block of |
| memory at the start of matching to use for this purpose. There are some |
| other special cases where extra memory is needed during matching. This |
| error is given when memory cannot be obtained. |
| |
| PCRE2_ERROR_NULL |
| |
| Either the code, subject, or match_data argument was passed as NULL. |
| |
| PCRE2_ERROR_RECURSELOOP |
| |
| This error is returned when pcre2_match() detects a recursion loop |
| within the pattern. Specifically, it means that either the whole pat- |
| tern or a subpattern has been called recursively for the second time at |
| the same position in the subject string. Some simple patterns that |
| might do this are detected and faulted at compile time, but more com- |
| plicated cases, in particular mutual recursions between two different |
| subpatterns, cannot be detected until matching is attempted. |
| |
| PCRE2_ERROR_RECURSIONLIMIT |
| |
| The internal recursion limit was reached. |
| |
| |
| EXTRACTING CAPTURED SUBSTRINGS BY NUMBER |
| |
| int pcre2_substring_length_bynumber(pcre2_match_data *match_data, |
| uint32_t number, PCRE2_SIZE *length); |
| |
| int pcre2_substring_copy_bynumber(pcre2_match_data *match_data, |
| uint32_t number, PCRE2_UCHAR *buffer, |
| PCRE2_SIZE *bufflen); |
| |
| int pcre2_substring_get_bynumber(pcre2_match_data *match_data, |
| uint32_t number, PCRE2_UCHAR **bufferptr, |
| PCRE2_SIZE *bufflen); |
| |
| void pcre2_substring_free(PCRE2_UCHAR *buffer); |
| |
| Captured substrings can be accessed directly by using the ovector as |
| described above. For convenience, auxiliary functions are provided for |
| extracting captured substrings as new, separate, zero-terminated |
| strings. A substring that contains a binary zero is correctly extracted |
| and has a further zero added on the end, but the result is not, of |
| course, a C string. |
| |
| The functions in this section identify substrings by number. The number |
| zero refers to the entire matched substring, with higher numbers refer- |
| ring to substrings captured by parenthesized groups. After a partial |
| match, only substring zero is available. An attempt to extract any |
| other substring gives the error PCRE2_ERROR_PARTIAL. The next section |
| describes similar functions for extracting captured substrings by name. |
| |
| If a pattern uses the \K escape sequence within a positive assertion, |
| the reported start of a successful match can be greater than the end of |
| the match. For example, if the pattern (?=ab\K) is matched against |
| "ab", the start and end offset values for the match are 2 and 0. In |
| this situation, calling these functions with a zero substring number |
| extracts a zero-length empty string. |
| |
| You can find the length in code units of a captured substring without |
| extracting it by calling pcre2_substring_length_bynumber(). The first |
| argument is a pointer to the match data block, the second is the group |
| number, and the third is a pointer to a variable into which the length |
| is placed. If you just want to know whether or not the substring has |
| been captured, you can pass the third argument as NULL. |
| |
| The pcre2_substring_copy_bynumber() function copies a captured sub- |
| string into a supplied buffer, whereas pcre2_substring_get_bynumber() |
| copies it into new memory, obtained using the same memory allocation |
| function that was used for the match data block. The first two argu- |
| ments of these functions are a pointer to the match data block and a |
| capturing group number. |
| |
| The final arguments of pcre2_substring_copy_bynumber() are a pointer to |
| the buffer and a pointer to a variable that contains its length in code |
| units. This is updated to contain the actual number of code units used |
| for the extracted substring, excluding the terminating zero. |
| |
| For pcre2_substring_get_bynumber() the third and fourth arguments point |
| to variables that are updated with a pointer to the new memory and the |
| number of code units that comprise the substring, again excluding the |
| terminating zero. When the substring is no longer needed, the memory |
| should be freed by calling pcre2_substring_free(). |
| |
| The return value from all these functions is zero for success, or a |
| negative error code. If the pattern match failed, the match failure |
| code is returned. If a substring number greater than zero is used |
| after a partial match, PCRE2_ERROR_PARTIAL is returned. Other possible |
| error codes are: |
| |
| PCRE2_ERROR_NOMEMORY |
| |
| The buffer was too small for pcre2_substring_copy_bynumber(), or the |
| attempt to get memory failed for pcre2_substring_get_bynumber(). |
| |
| PCRE2_ERROR_NOSUBSTRING |
| |
| There is no substring with that number in the pattern, that is, the |
| number is greater than the number of capturing parentheses. |
| |
| PCRE2_ERROR_UNAVAILABLE |
| |
| The substring number, though not greater than the number of captures in |
| the pattern, is greater than the number of slots in the ovector, so the |
| substring could not be captured. |
| |
| PCRE2_ERROR_UNSET |
| |
| The substring did not participate in the match. For example, if the |
| pattern is (abc)|(def) and the subject is "def", and the ovector con- |
| tains at least two capturing slots, substring number 1 is unset. |
| |
| |
| EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS |
| |
| int pcre2_substring_list_get(pcre2_match_data *match_data, |
| PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr); |
| |
| void pcre2_substring_list_free(PCRE2_SPTR *list); |
| |
| The pcre2_substring_list_get() function extracts all available sub- |
| strings and builds a list of pointers to them. It also (optionally) |
| builds a second list that contains their lengths (in code units), |
| excluding a terminating zero that is added to each of them. All this is |
| done in a single block of memory that is obtained using the same memory |
| allocation function that was used to get the match data block. |
| |
| This function must be called only after a successful match. If called |
| after a partial match, the error code PCRE2_ERROR_PARTIAL is returned. |
| |
| The address of the memory block is returned via listptr, which is also |
| the start of the list of string pointers. The end of the list is marked |
| by a NULL pointer. The address of the list of lengths is returned via |
| lengthsptr. If your strings do not contain binary zeros and you do not |
| therefore need the lengths, you may supply NULL as the lengthsptr argu- |
| ment to disable the creation of a list of lengths. The yield of the |
| function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the mem- |
| ory block could not be obtained. When the list is no longer needed, it |
| should be freed by calling pcre2_substring_list_free(). |
| |
| If this function encounters a substring that is unset, which can happen |
| when capturing subpattern number n+1 matches some part of the subject, |
| but subpattern n has not been used at all, it returns an empty string. |
| This can be distinguished from a genuine zero-length substring by |
| inspecting the appropriate offset in the ovector, which contain |
| PCRE2_UNSET for unset substrings, or by calling pcre2_sub- |
| string_length_bynumber(). |
| |
| |
| EXTRACTING CAPTURED SUBSTRINGS BY NAME |
| |
| int pcre2_substring_number_from_name(const pcre2_code *code, |
| PCRE2_SPTR name); |
| |
| int pcre2_substring_length_byname(pcre2_match_data *match_data, |
| PCRE2_SPTR name, PCRE2_SIZE *length); |
| |
| int pcre2_substring_copy_byname(pcre2_match_data *match_data, |
| PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen); |
| |
| int pcre2_substring_get_byname(pcre2_match_data *match_data, |
| PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen); |
| |
| void pcre2_substring_free(PCRE2_UCHAR *buffer); |
| |
| To extract a substring by name, you first have to find associated num- |
| ber. For example, for this pattern: |
| |
| (a+)b(?<xxx>\d+)... |
| |
| the number of the subpattern called "xxx" is 2. If the name is known to |
| be unique (PCRE2_DUPNAMES was not set), you can find the number from |
| the name by calling pcre2_substring_number_from_name(). The first argu- |
| ment is the compiled pattern, and the second is the name. The yield of |
| the function is the subpattern number, PCRE2_ERROR_NOSUBSTRING if there |
| is no subpattern of that name, or PCRE2_ERROR_NOUNIQUESUBSTRING if |
| there is more than one subpattern of that name. Given the number, you |
| can extract the substring directly, or use one of the functions |
| described above. |
| |
| For convenience, there are also "byname" functions that correspond to |
| the "bynumber" functions, the only difference being that the second |
| argument is a name instead of a number. If PCRE2_DUPNAMES is set and |
| there are duplicate names, these functions scan all the groups with the |
| given name, and return the first named string that is set. |
| |
| If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is |
| returned. If all groups with the name have numbers that are greater |
| than the number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is |
| returned. If there is at least one group with a slot in the ovector, |
| but no group is found to be set, PCRE2_ERROR_UNSET is returned. |
| |
| Warning: If the pattern uses the (?| feature to set up multiple subpat- |
| terns with the same number, as described in the section on duplicate |
| subpattern numbers in the pcre2pattern page, you cannot use names to |
| distinguish the different subpatterns, because names are not included |
| in the compiled code. The matching process uses only numbers. For this |
| reason, the use of different names for subpatterns of the same number |
| causes an error at compile time. |
| |
| |
| CREATING A NEW STRING WITH SUBSTITUTIONS |
| |
| int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject, |
| PCRE2_SIZE length, PCRE2_SIZE startoffset, |
| uint32_t options, pcre2_match_data *match_data, |
| pcre2_match_context *mcontext, PCRE2_SPTR replacement, |
| PCRE2_SIZE rlength, PCRE2_UCHAR *outputbufferP, |
| PCRE2_SIZE *outlengthptr); |
| |
| This function calls pcre2_match() and then makes a copy of the subject |
| string in outputbuffer, replacing the part that was matched with the |
| replacement string, whose length is supplied in rlength. This can be |
| given as PCRE2_ZERO_TERMINATED for a zero-terminated string. Matches in |
| which a \K item in a lookahead in the pattern causes the match to end |
| before it starts are not supported, and give rise to an error return. |
| |
| The first seven arguments of pcre2_substitute() are the same as for |
| pcre2_match(), except that the partial matching options are not permit- |
| ted, and match_data may be passed as NULL, in which case a match data |
| block is obtained and freed within this function, using memory manage- |
| ment functions from the match context, if provided, or else those that |
| were used to allocate memory for the compiled code. |
| |
| The outlengthptr argument must point to a variable that contains the |
| length, in code units, of the output buffer. If the function is suc- |
| cessful, the value is updated to contain the length of the new string, |
| excluding the trailing zero that is automatically added. |
| |
| If the function is not successful, the value set via outlengthptr |
| depends on the type of error. For syntax errors in the replacement |
| string, the value is the offset in the replacement string where the |
| error was detected. For other errors, the value is PCRE2_UNSET by |
| default. This includes the case of the output buffer being too small, |
| unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set (see below), in which |
| case the value is the minimum length needed, including space for the |
| trailing zero. Note that in order to compute the required length, |
| pcre2_substitute() has to simulate all the matching and copying, |
| instead of giving an error return as soon as the buffer overflows. Note |
| also that the length is in code units, not bytes. |
| |
| In the replacement string, which is interpreted as a UTF string in UTF |
| mode, and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK |
| option is set, a dollar character is an escape character that can spec- |
| ify the insertion of characters from capturing groups or (*MARK) items |
| in the pattern. The following forms are always recognized: |
| |
| $$ insert a dollar character |
| $<n> or ${<n>} insert the contents of group <n> |
| $*MARK or ${*MARK} insert the name of the last (*MARK) encountered |
| |
| Either a group number or a group name can be given for <n>. Curly |
| brackets are required only if the following character would be inter- |
| preted as part of the number or name. The number may be zero to include |
| the entire matched string. For example, if the pattern a(b)c is |
| matched with "=abc=" and the replacement string "+$1$0$1+", the result |
| is "=+babcb+=". |
| |
| The facility for inserting a (*MARK) name can be used to perform simple |
| simultaneous substitutions, as this pcre2test example shows: |
| |
| /(*:pear)apple|(*:orange)lemon/g,replace=${*MARK} |
| apple lemon |
| 2: pear orange |
| |
| As well as the usual options for pcre2_match(), a number of additional |
| options can be set in the options argument. |
| |
| PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject |
| string, replacing every matching substring. If this is not set, only |
| the first matching substring is replaced. If any matched substring has |
| zero length, after the substitution has happened, an attempt to find a |
| non-empty match at the same position is performed. If this is not suc- |
| cessful, the current position is advanced by one character except when |
| CRLF is a valid newline sequence and the next two characters are CR, |
| LF. In this case, the current position is advanced by two characters. |
| |
| PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output |
| buffer is too small. The default action is to return PCRE2_ERROR_NOMEM- |
| ORY immediately. If this option is set, however, pcre2_substitute() |
| continues to go through the motions of matching and substituting (with- |
| out, of course, writing anything) in order to compute the size of buf- |
| fer that is needed. This value is passed back via the outlengthptr |
| variable, with the result of the function still being |
| PCRE2_ERROR_NOMEMORY. |
| |
| Passing a buffer size of zero is a permitted way of finding out how |
| much memory is needed for given substitution. However, this does mean |
| that the entire operation is carried out twice. Depending on the appli- |
| cation, it may be more efficient to allocate a large buffer and free |
| the excess afterwards, instead of using PCRE2_SUBSTITUTE_OVER- |
| FLOW_LENGTH. |
| |
| PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capturing groups |
| that do not appear in the pattern to be treated as unset groups. This |
| option should be used with care, because it means that a typo in a |
| group name or number no longer causes the PCRE2_ERROR_NOSUBSTRING |
| error. |
| |
| PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capturing groups (including |
| unknown groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be |
| treated as empty strings when inserted as described above. If this |
| option is not set, an attempt to insert an unset group causes the |
| PCRE2_ERROR_UNSET error. This option does not influence the extended |
| substitution syntax described below. |
| |
| PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the |
| replacement string. Without this option, only the dollar character is |
| special, and only the group insertion forms listed above are valid. |
| When PCRE2_SUBSTITUTE_EXTENDED is set, two things change: |
| |
| Firstly, backslash in a replacement string is interpreted as an escape |
| character. The usual forms such as \n or \x{ddd} can be used to specify |
| particular character codes, and backslash followed by any non-alphanu- |
| meric character quotes that character. Extended quoting can be coded |
| using \Q...\E, exactly as in pattern strings. |
| |
| There are also four escape sequences for forcing the case of inserted |
| letters. The insertion mechanism has three states: no case forcing, |
| force upper case, and force lower case. The escape sequences change the |
| current state: \U and \L change to upper or lower case forcing, respec- |
| tively, and \E (when not terminating a \Q quoted sequence) reverts to |
| no case forcing. The sequences \u and \l force the next character (if |
| it is a letter) to upper or lower case, respectively, and then the |
| state automatically reverts to no case forcing. Case forcing applies to |
| all inserted characters, including those from captured groups and let- |
| ters within \Q...\E quoted sequences. |
| |
| Note that case forcing sequences such as \U...\E do not nest. For exam- |
| ple, the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final |
| \E has no effect. |
| |
| The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more |
| flexibility to group substitution. The syntax is similar to that used |
| by Bash: |
| |
| ${<n>:-<string>} |
| ${<n>:+<string1>:<string2>} |
| |
| As before, <n> may be a group number or a name. The first form speci- |
| fies a default value. If group <n> is set, its value is inserted; if |
| not, <string> is expanded and the result inserted. The second form |
| specifies strings that are expanded and inserted when group <n> is set |
| or unset, respectively. The first form is just a convenient shorthand |
| for |
| |
| ${<n>:+${<n>}:<string>} |
| |
| Backslash can be used to escape colons and closing curly brackets in |
| the replacement strings. A change of the case forcing state within a |
| replacement string remains in force afterwards, as shown in this |
| pcre2test example: |
| |
| /(some)?(body)/substitute_extended,replace=${1:+\U:\L}HeLLo |
| body |
| 1: hello |
| somebody |
| 1: HELLO |
| |
| The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended |
| substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause |
| unknown groups in the extended syntax forms to be treated as unset. |
| |
| If successful, pcre2_substitute() returns the number of replacements |
| that were made. This may be zero if no matches were found, and is never |
| greater than 1 unless PCRE2_SUBSTITUTE_GLOBAL is set. |
| |
| In the event of an error, a negative error code is returned. Except for |
| PCRE2_ERROR_NOMATCH (which is never returned), errors from |
| pcre2_match() are passed straight back. |
| |
| PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser- |
| tion, unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set. |
| |
| PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ- |
| ing an unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) |
| when the simple (non-extended) syntax is used and PCRE2_SUBSTI- |
| TUTE_UNSET_EMPTY is not set. |
| |
| PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big |
| enough. If the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size |
| of buffer that is needed is returned via outlengthptr. Note that this |
| does not happen by default. |
| |
| PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in |
| the replacement string, with more particular errors being |
| PCRE2_ERROR_BADREPESCAPE (invalid escape sequence), PCRE2_ERROR_REP- |
| MISSING_BRACE (closing curly bracket not found), PCRE2_BADSUBSTITUTION |
| (syntax error in extended group substitution), and PCRE2_BADSUBPATTERN |
| (the pattern match ended before it started, which can happen if \K is |
| used in an assertion). |
| |
| As for all PCRE2 errors, a text message that describes the error can be |
| obtained by calling pcre2_get_error_message(). |
| |
| |
| DUPLICATE SUBPATTERN NAMES |
| |
| int pcre2_substring_nametable_scan(const pcre2_code *code, |
| PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last); |
| |
| When a pattern is compiled with the PCRE2_DUPNAMES option, names for |
| subpatterns are not required to be unique. Duplicate names are always |
| allowed for subpatterns with the same number, created by using the (?| |
| feature. Indeed, if such subpatterns are named, they are required to |
| use the same names. |
| |
| Normally, patterns with duplicate names are such that in any one match, |
| only one of the named subpatterns participates. An example is shown in |
| the pcre2pattern documentation. |
| |
| When duplicates are present, pcre2_substring_copy_byname() and |
| pcre2_substring_get_byname() return the first substring corresponding |
| to the given name that is set. Only if none are set is |
| PCRE2_ERROR_UNSET is returned. The pcre2_substring_number_from_name() |
| function returns the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are |
| duplicate names. |
| |
| If you want to get full details of all captured substrings for a given |
| name, you must use the pcre2_substring_nametable_scan() function. The |
| first argument is the compiled pattern, and the second is the name. If |
| the third and fourth arguments are NULL, the function returns a group |
| number for a unique name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise. |
| |
| When the third and fourth arguments are not NULL, they must be pointers |
| to variables that are updated by the function. After it has run, they |
| point to the first and last entries in the name-to-number table for the |
| given name, and the function returns the length of each entry in code |
| units. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there are |
| no entries for the given name. |
| |
| The format of the name table is described above in the section entitled |
| Information about a pattern. Given all the relevant entries for the |
| name, you can extract each of their numbers, and hence the captured |
| data. |
| |
| |
| FINDING ALL POSSIBLE MATCHES AT ONE POSITION |
| |
| The traditional matching function uses a similar algorithm to Perl, |
| which stops when it finds the first match at a given point in the sub- |
| ject. If you want to find all possible matches, or the longest possible |
| match at a given position, consider using the alternative matching |
| function (see below) instead. If you cannot use the alternative func- |
| tion, you can kludge it up by making use of the callout facility, which |
| is described in the pcre2callout documentation. |
| |
| What you have to do is to insert a callout right at the end of the pat- |
| tern. When your callout function is called, extract and save the cur- |
| rent matched substring. Then return 1, which forces pcre2_match() to |
| backtrack and try other alternatives. Ultimately, when it runs out of |
| matches, pcre2_match() will yield PCRE2_ERROR_NOMATCH. |
| |
| |
| MATCHING A PATTERN: THE ALTERNATIVE FUNCTION |
| |
| int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject, |
| PCRE2_SIZE length, PCRE2_SIZE startoffset, |
| uint32_t options, pcre2_match_data *match_data, |
| pcre2_match_context *mcontext, |
| int *workspace, PCRE2_SIZE wscount); |
| |
| The function pcre2_dfa_match() is called to match a subject string |
| against a compiled pattern, using a matching algorithm that scans the |
| subject string just once, and does not backtrack. This has different |
| characteristics to the normal algorithm, and is not compatible with |
| Perl. Some of the features of PCRE2 patterns are not supported. Never- |
| theless, there are times when this kind of matching can be useful. For |
| a discussion of the two matching algorithms, and a list of features |
| that pcre2_dfa_match() does not support, see the pcre2matching documen- |
| tation. |
| |
| The arguments for the pcre2_dfa_match() function are the same as for |
| pcre2_match(), plus two extras. The ovector within the match data block |
| is used in a different way, and this is described below. The other com- |
| mon arguments are used in the same way as for pcre2_match(), so their |
| description is not repeated here. |
| |
| The two additional arguments provide workspace for the function. The |
| workspace vector should contain at least 20 elements. It is used for |
| keeping track of multiple paths through the pattern tree. More |
| workspace is needed for patterns and subjects where there are a lot of |
| potential matches. |
| |
| Here is an example of a simple call to pcre2_dfa_match(): |
| |
| int wspace[20]; |
| pcre2_match_data *md = pcre2_match_data_create(4, NULL); |
| int rc = pcre2_dfa_match( |
| re, /* result of pcre2_compile() */ |
| "some string", /* the subject string */ |
| 11, /* the length of the subject string */ |
| 0, /* start at offset 0 in the subject */ |
| 0, /* default options */ |
| match_data, /* the match data block */ |
| NULL, /* a match context; NULL means use defaults */ |
| wspace, /* working space vector */ |
| 20); /* number of elements (NOT size in bytes) */ |
| |
| Option bits for pcre_dfa_match() |
| |
| The unused bits of the options argument for pcre2_dfa_match() must be |
| zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_NOTBOL, |
| PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, |
| PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, PCRE2_PARTIAL_SOFT, |
| PCRE2_DFA_SHORTEST, and PCRE2_DFA_RESTART. All but the last four of |
| these are exactly the same as for pcre2_match(), so their description |
| is not repeated here. |
| |
| PCRE2_PARTIAL_HARD |
| PCRE2_PARTIAL_SOFT |
| |
| These have the same general effect as they do for pcre2_match(), but |
| the details are slightly different. When PCRE2_PARTIAL_HARD is set for |
| pcre2_dfa_match(), it returns PCRE2_ERROR_PARTIAL if the end of the |
| subject is reached and there is still at least one matching possibility |
| that requires additional characters. This happens even if some complete |
| matches have already been found. When PCRE2_PARTIAL_SOFT is set, the |
| return code PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL |
| if the end of the subject is reached, there have been no complete |
| matches, but there is still at least one matching possibility. The por- |
| tion of the string that was inspected when the longest partial match |
| was found is set as the first matching string in both cases. There is a |
| more detailed discussion of partial and multi-segment matching, with |
| examples, in the pcre2partial documentation. |
| |
| PCRE2_DFA_SHORTEST |
| |
| Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm to |
| stop as soon as it has found one match. Because of the way the alterna- |
| tive algorithm works, this is necessarily the shortest possible match |
| at the first possible matching point in the subject string. |
| |
| PCRE2_DFA_RESTART |
| |
| When pcre2_dfa_match() returns a partial match, it is possible to call |
| it again, with additional subject characters, and have it continue with |
| the same match. The PCRE2_DFA_RESTART option requests this action; when |
| it is set, the workspace and wscount options must reference the same |
| vector as before because data about the match so far is left in them |
| after a partial match. There is more discussion of this facility in the |
| pcre2partial documentation. |
| |
| Successful returns from pcre2_dfa_match() |
| |
| When pcre2_dfa_match() succeeds, it may have matched more than one sub- |
| string in the subject. Note, however, that all the matches from one run |
| of the function start at the same point in the subject. The shorter |
| matches are all initial substrings of the longer matches. For example, |
| if the pattern |
| |
| <.*> |
| |
| is matched against the string |
| |
| This is <something> <something else> <something further> no more |
| |
| the three matched strings are |
| |
| <something> <something else> <something further> |
| <something> <something else> |
| <something> |
| |
| On success, the yield of the function is a number greater than zero, |
| which is the number of matched substrings. The offsets of the sub- |
| strings are returned in the ovector, and can be extracted by number in |
| the same way as for pcre2_match(), but the numbers bear no relation to |
| any capturing groups that may exist in the pattern, because DFA match- |
| ing does not support group capture. |
| |
| Calls to the convenience functions that extract substrings by name |
| return the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used |
| after a DFA match. The convenience functions that extract substrings by |
| number never return PCRE2_ERROR_NOSUBSTRING, and the meanings of some |
| other errors are slightly different: |
| |
| PCRE2_ERROR_UNAVAILABLE |
| |
| The ovector is not big enough to include a slot for the given substring |
| number. |
| |
| PCRE2_ERROR_UNSET |
| |
| There is a slot in the ovector for this substring, but there were |
| insufficient matches to fill it. |
| |
| The matched strings are stored in the ovector in reverse order of |
| length; that is, the longest matching string is first. If there were |
| too many matches to fit into the ovector, the yield of the function is |
| zero, and the vector is filled with the longest matches. |
| |
| NOTE: PCRE2's "auto-possessification" optimization usually applies to |
| character repeats at the end of a pattern (as well as internally). For |
| example, the pattern "a\d+" is compiled as if it were "a\d++". For DFA |
| matching, this means that only one possible match is found. If you |
| really do want multiple matches in such cases, either use an ungreedy |
| repeat auch as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when |
| compiling. |
| |
| Error returns from pcre2_dfa_match() |
| |
| The pcre2_dfa_match() function returns a negative number when it fails. |
| Many of the errors are the same as for pcre2_match(), as described |
| above. There are in addition the following errors that are specific to |
| pcre2_dfa_match(): |
| |
| PCRE2_ERROR_DFA_UITEM |
| |
| This return is given if pcre2_dfa_match() encounters an item in the |
| pattern that it does not support, for instance, the use of \C in a UTF |
| mode or a back reference. |
| |
| PCRE2_ERROR_DFA_UCOND |
| |
| This return is given if pcre2_dfa_match() encounters a condition item |
| that uses a back reference for the condition, or a test for recursion |
| in a specific group. These are not supported. |
| |
| PCRE2_ERROR_DFA_WSSIZE |
| |
| This return is given if pcre2_dfa_match() runs out of space in the |
| workspace vector. |
| |
| PCRE2_ERROR_DFA_RECURSE |
| |
| When a recursive subpattern is processed, the matching function calls |
| itself recursively, using private memory for the ovector and workspace. |
| This error is given if the internal ovector is not large enough. This |
| should be extremely rare, as a vector of size 1000 is used. |
| |
| PCRE2_ERROR_DFA_BADRESTART |
| |
| When pcre2_dfa_match() is called with the PCRE2_DFA_RESTART option, |
| some plausibility checks are made on the contents of the workspace, |
| which should contain data about the previous partial match. If any of |
| these checks fail, this error is given. |
| |
| |
| SEE ALSO |
| |
| pcre2build(3), pcre2callout(3), pcre2demo(3), pcre2matching(3), |
| pcre2partial(3), pcre2posix(3), pcre2sample(3), pcre2stack(3), |
| pcre2unicode(3). |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 16 December 2015 |
| Copyright (c) 1997-2015 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |
| PCRE2BUILD(3) Library Functions Manual PCRE2BUILD(3) |
| |
| |
| |
| NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| |
| BUILDING PCRE2 |
| |
| PCRE2 is distributed with a configure script that can be used to build |
| the library in Unix-like environments using the applications known as |
| Autotools. Also in the distribution are files to support building using |
| CMake instead of configure. The text file README contains general |
| information about building with Autotools (some of which is repeated |
| below), and also has some comments about building on various operating |
| systems. There is a lot more information about building PCRE2 without |
| using Autotools (including information about using CMake and building |
| "by hand") in the text file called NON-AUTOTOOLS-BUILD. You should |
| consult this file as well as the README file if you are building in a |
| non-Unix-like environment. |
| |
| |
| PCRE2 BUILD-TIME OPTIONS |
| |
| The rest of this document describes the optional features of PCRE2 that |
| can be selected when the library is compiled. It assumes use of the |
| configure script, where the optional features are selected or dese- |
| lected by providing options to configure before running the make com- |
| mand. However, the same options can be selected in both Unix-like and |
| non-Unix-like environments if you are using CMake instead of configure |
| to build PCRE2. |
| |
| If you are not using Autotools or CMake, option selection can be done |
| by editing the config.h file, or by passing parameter settings to the |
| compiler, as described in NON-AUTOTOOLS-BUILD. |
| |
| The complete list of options for configure (which includes the standard |
| ones such as the selection of the installation directory) can be |
| obtained by running |
| |
| ./configure --help |
| |
| The following sections include descriptions of options whose names |
| begin with --enable or --disable. These settings specify changes to the |
| defaults for the configure command. Because of the way that configure |
| works, --enable and --disable always come in pairs, so the complemen- |
| tary option always exists as well, but as it specifies the default, it |
| is not described. |
| |
| |
| BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES |
| |
| By default, a library called libpcre2-8 is built, containing functions |
| that take string arguments contained in vectors of bytes, interpreted |
| either as single-byte characters, or UTF-8 strings. You can also build |
| two other libraries, called libpcre2-16 and libpcre2-32, which process |
| strings that are contained in vectors of 16-bit and 32-bit code units, |
| respectively. These can be interpreted either as single-unit characters |
| or UTF-16/UTF-32 strings. To build these additional libraries, add one |
| or both of the following to the configure command: |
| |
| --enable-pcre2-16 |
| --enable-pcre2-32 |
| |
| If you do not want the 8-bit library, add |
| |
| --disable-pcre2-8 |
| |
| as well. At least one of the three libraries must be built. Note that |
| the POSIX wrapper is for the 8-bit library only, and that pcre2grep is |
| an 8-bit program. Neither of these are built if you select only the |
| 16-bit or 32-bit libraries. |
| |
| |
| BUILDING SHARED AND STATIC LIBRARIES |
| |
| The Autotools PCRE2 building process uses libtool to build both shared |
| and static libraries by default. You can suppress an unwanted library |
| by adding one of |
| |
| --disable-shared |
| --disable-static |
| |
| to the configure command. |
| |
| |
| UNICODE AND UTF SUPPORT |
| |
| By default, PCRE2 is built with support for Unicode and UTF character |
| strings. To build it without Unicode support, add |
| |
| --disable-unicode |
| |
| to the configure command. This setting applies to all three libraries. |
| It is not possible to build one library with Unicode support, and |
| another without, in the same configuration. |
| |
| Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, |
| UTF-16 or UTF-32. To do that, applications that use the library can set |
| the PCRE2_UTF option when they call pcre2_compile() to compile a pat- |
| tern. Alternatively, patterns may be started with (*UTF) unless the |
| application has locked this out by setting PCRE2_NEVER_UTF. |
| |
| UTF support allows the libraries to process character code points up to |
| 0x10ffff in the strings that they handle. It also provides support for |
| accessing the Unicode properties of such characters, using pattern |
| escapes such as \P, \p, and \X. Only the general category properties |
| such as Lu and Nd are supported. Details are given in the pcre2pattern |
| documentation. |
| |
| Pattern escapes such as \d and \w do not by default make use of Unicode |
| properties. The application can request that they do by setting the |
| PCRE2_UCP option. Unless the application has set PCRE2_NEVER_UCP, a |
| pattern may also request this by starting with (*UCP). |
| |
| |
| DISABLING THE USE OF \C |
| |
| The \C escape sequence, which matches a single code unit, even in a UTF |
| mode, can cause unpredictable behaviour because it may leave the cur- |
| rent matching point in the middle of a multi-code-unit character. The |
| application can lock it out by setting the PCRE2_NEVER_BACKSLASH_C |
| option when calling pcre2_compile(). There is also a build-time option |
| |
| --enable-never-backslash-C |
| |
| (note the upper case C) which locks out the use of \C entirely. |
| |
| |
| JUST-IN-TIME COMPILER SUPPORT |
| |
| Just-in-time compiler support is included in the build by specifying |
| |
| --enable-jit |
| |
| This support is available only for certain hardware architectures. If |
| this option is set for an unsupported architecture, a building error |
| occurs. See the pcre2jit documentation for a discussion of JIT usage. |
| When JIT support is enabled, pcre2grep automatically makes use of it, |
| unless you add |
| |
| --disable-pcre2grep-jit |
| |
| to the "configure" command. |
| |
| |
| NEWLINE RECOGNITION |
| |
| By default, PCRE2 interprets the linefeed (LF) character as indicating |
| the end of a line. This is the normal newline character on Unix-like |
| systems. You can compile PCRE2 to use carriage return (CR) instead, by |
| adding |
| |
| --enable-newline-is-cr |
| |
| to the configure command. There is also an --enable-newline-is-lf |
| option, which explicitly specifies linefeed as the newline character. |
| |
| Alternatively, you can specify that line endings are to be indicated by |
| the two-character sequence CRLF (CR immediately followed by LF). If you |
| want this, add |
| |
| --enable-newline-is-crlf |
| |
| to the configure command. There is a fourth option, specified by |
| |
| --enable-newline-is-anycrlf |
| |
| which causes PCRE2 to recognize any of the three sequences CR, LF, or |
| CRLF as indicating a line ending. Finally, a fifth option, specified by |
| |
| --enable-newline-is-any |
| |
| causes PCRE2 to recognize any Unicode newline sequence. The Unicode |
| newline sequences are the three just mentioned, plus the single charac- |
| ters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line, |
| U+0085), LS (line separator, U+2028), and PS (paragraph separator, |
| U+2029). |
| |
| Whatever default line ending convention is selected when PCRE2 is built |
| can be overridden by applications that use the library. At build time |
| it is conventional to use the standard for your operating system. |
| |
| |
| WHAT \R MATCHES |
| |
| By default, the sequence \R in a pattern matches any Unicode newline |
| sequence, independently of what has been selected as the line ending |
| sequence. If you specify |
| |
| --enable-bsr-anycrlf |
| |
| the default is changed so that \R matches only CR, LF, or CRLF. What- |
| ever is selected when PCRE2 is built can be overridden by applications |
| that use the called. |
| |
| |
| HANDLING VERY LARGE PATTERNS |
| |
| Within a compiled pattern, offset values are used to point from one |
| part to another (for example, from an opening parenthesis to an alter- |
| nation metacharacter). By default, in the 8-bit and 16-bit libraries, |
| two-byte values are used for these offsets, leading to a maximum size |
| for a compiled pattern of around 64K code units. This is sufficient to |
| handle all but the most gigantic patterns. Nevertheless, some people do |
| want to process truly enormous patterns, so it is possible to compile |
| PCRE2 to use three-byte or four-byte offsets by adding a setting such |
| as |
| |
| --with-link-size=3 |
| |
| to the configure command. The value given must be 2, 3, or 4. For the |
| 16-bit library, a value of 3 is rounded up to 4. In these libraries, |
| using longer offsets slows down the operation of PCRE2 because it has |
| to load additional data when handling them. For the 32-bit library the |
| value is always 4 and cannot be overridden; the value of --with-link- |
| size is ignored. |
| |
| |
| AVOIDING EXCESSIVE STACK USAGE |
| |
| When matching with the pcre2_match() function, PCRE2 implements back- |
| tracking by making recursive calls to an internal function called |
| match(). In environments where the size of the stack is limited, this |
| can severely limit PCRE2's operation. (The Unix environment does not |
| usually suffer from this problem, but it may sometimes be necessary to |
| increase the maximum stack size. There is a discussion in the |
| pcre2stack documentation.) An alternative approach to recursion that |
| uses memory from the heap to remember data, instead of using recursive |
| function calls, has been implemented to work round the problem of lim- |
| ited stack size. If you want to build a version of PCRE2 that works |
| this way, add |
| |
| --disable-stack-for-recursion |
| |
| to the configure command. By default, the system functions malloc() and |
| free() are called to manage the heap memory that is required, but cus- |
| tom memory management functions can be called instead. PCRE2 runs |
| noticeably more slowly when built in this way. This option affects only |
| the pcre2_match() function; it is not relevant for pcre2_dfa_match(). |
| |
| |
| LIMITING PCRE2 RESOURCE USAGE |
| |
| Internally, PCRE2 has a function called match(), which it calls repeat- |
| edly (sometimes recursively) when matching a pattern with the |
| pcre2_match() function. By controlling the maximum number of times this |
| function may be called during a single matching operation, a limit can |
| be placed on the resources used by a single call to pcre2_match(). The |
| limit can be changed at run time, as described in the pcre2api documen- |
| tation. The default is 10 million, but this can be changed by adding a |
| setting such as |
| |
| --with-match-limit=500000 |
| |
| to the configure command. This setting has no effect on the |
| pcre2_dfa_match() matching function. |
| |
| In some environments it is desirable to limit the depth of recursive |
| calls of match() more strictly than the total number of calls, in order |
| to restrict the maximum amount of stack (or heap, if --disable-stack- |
| for-recursion is specified) that is used. A second limit controls this; |
| it defaults to the value that is set for --with-match-limit, which |
| imposes no additional constraints. However, you can set a lower limit |
| by adding, for example, |
| |
| --with-match-limit-recursion=10000 |
| |
| to the configure command. This value can also be overridden at run |
| time. |
| |
| |
| CREATING CHARACTER TABLES AT BUILD TIME |
| |
| PCRE2 uses fixed tables for processing characters whose code points are |
| less than 256. By default, PCRE2 is built with a set of tables that are |
| distributed in the file src/pcre2_chartables.c.dist. These tables are |
| for ASCII codes only. If you add |
| |
| --enable-rebuild-chartables |
| |
| to the configure command, the distributed tables are no longer used. |
| Instead, a program called dftables is compiled and run. This outputs |
| the source for new set of tables, created in the default locale of your |
| C run-time system. (This method of replacing the tables does not work |
| if you are cross compiling, because dftables is run on the local host. |
| If you need to create alternative tables when cross compiling, you will |
| have to do so "by hand".) |
| |
| |
| USING EBCDIC CODE |
| |
| PCRE2 assumes by default that it will run in an environment where the |
| character code is ASCII or Unicode, which is a superset of ASCII. This |
| is the case for most computer operating systems. PCRE2 can, however, be |
| compiled to run in an 8-bit EBCDIC environment by adding |
| |
| --enable-ebcdic --disable-unicode |
| |
| to the configure command. This setting implies --enable-rebuild-charta- |
| bles. You should only use it if you know that you are in an EBCDIC |
| environment (for example, an IBM mainframe operating system). |
| |
| It is not possible to support both EBCDIC and UTF-8 codes in the same |
| version of the library. Consequently, --enable-unicode and --enable- |
| ebcdic are mutually exclusive. |
| |
| The EBCDIC character that corresponds to an ASCII LF is assumed to have |
| the value 0x15 by default. However, in some EBCDIC environments, 0x25 |
| is used. In such an environment you should use |
| |
| --enable-ebcdic-nl25 |
| |
| as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR |
| has the same value as in ASCII, namely, 0x0d. Whichever of 0x15 and |
| 0x25 is not chosen as LF is made to correspond to the Unicode NEL char- |
| acter (which, in Unicode, is 0x85). |
| |
| The options that select newline behaviour, such as --enable-newline-is- |
| cr, and equivalent run-time options, refer to these character values in |
| an EBCDIC environment. |
| |
| |
| PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT |
| |
| By default, pcre2grep reads all files as plain text. You can build it |
| so that it recognizes files whose names end in .gz or .bz2, and reads |
| them with libz or libbz2, respectively, by adding one or both of |
| |
| --enable-pcre2grep-libz |
| --enable-pcre2grep-libbz2 |
| |
| to the configure command. These options naturally require that the rel- |
| evant libraries are installed on your system. Configuration will fail |
| if they are not. |
| |
| |
| PCRE2GREP BUFFER SIZE |
| |
| pcre2grep uses an internal buffer to hold a "window" on the file it is |
| scanning, in order to be able to output "before" and "after" lines when |
| it finds a match. The size of the buffer is controlled by a parameter |
| whose default value is 20K. The buffer itself is three times this size, |
| but because of the way it is used for holding "before" lines, the long- |
| est line that is guaranteed to be processable is the parameter size. |
| You can change the default parameter value by adding, for example, |
| |
| --with-pcre2grep-bufsize=50K |
| |
| to the configure command. The caller of pcre2grep can override this |
| value by using --buffer-size on the command line.. |
| |
| |
| PCRE2TEST OPTION FOR LIBREADLINE SUPPORT |
| |
| If you add one of |
| |
| --enable-pcre2test-libreadline |
| --enable-pcre2test-libedit |
| |
| to the configure command, pcre2test is linked with the libreadline |
| orlibedit library, respectively, and when its input is from a terminal, |
| it reads it using the readline() function. This provides line-editing |
| and history facilities. Note that libreadline is GPL-licensed, so if |
| you distribute a binary of pcre2test linked in this way, there may be |
| licensing issues. These can be avoided by linking instead with libedit, |
| which has a BSD licence. |
| |
| Setting --enable-pcre2test-libreadline causes the -lreadline option to |
| be added to the pcre2test build. In many operating environments with a |
| sytem-installed readline library this is sufficient. However, in some |
| environments (e.g. if an unmodified distribution version of readline is |
| in use), some extra configuration may be necessary. The INSTALL file |
| for libreadline says this: |
| |
| "Readline uses the termcap functions, but does not link with |
| the termcap or curses library itself, allowing applications |
| which link with readline the to choose an appropriate library." |
| |
| If your environment has not been set up so that an appropriate library |
| is automatically included, you may need to add something like |
| |
| LIBS="-ncurses" |
| |
| immediately before the configure command. |
| |
| |
| INCLUDING DEBUGGING CODE |
| |
| If you add |
| |
| --enable-debug |
| |
| to the configure command, additional debugging code is included in the |
| build. This feature is intended for use by the PCRE2 maintainers. |
| |
| |
| DEBUGGING WITH VALGRIND SUPPORT |
| |
| If you add |
| |
| --enable-valgrind |
| |
| to the configure command, PCRE2 will use valgrind annotations to mark |
| certain memory regions as unaddressable. This allows it to detect |
| invalid memory accesses, and is mostly useful for debugging PCRE2 |
| itself. |
| |
| |
| CODE COVERAGE REPORTING |
| |
| If your C compiler is gcc, you can build a version of PCRE2 that can |
| generate a code coverage report for its test suite. To enable this, you |
| must install lcov version 1.6 or above. Then specify |
| |
| --enable-coverage |
| |
| to the configure command and build PCRE2 in the usual way. |
| |
| Note that using ccache (a caching C compiler) is incompatible with code |
| coverage reporting. If you have configured ccache to run automatically |
| on your system, you must set the environment variable |
| |
| CCACHE_DISABLE=1 |
| |
| before running make to build PCRE2, so that ccache is not used. |
| |
| When --enable-coverage is used, the following addition targets are |
| added to the Makefile: |
| |
| make coverage |
| |
| This creates a fresh coverage report for the PCRE2 test suite. It is |
| equivalent to running "make coverage-reset", "make coverage-baseline", |
| "make check", and then "make coverage-report". |
| |
| make coverage-reset |
| |
| This zeroes the coverage counters, but does nothing else. |
| |
| make coverage-baseline |
| |
| This captures baseline coverage information. |
| |
| make coverage-report |
| |
| This creates the coverage report. |
| |
| make coverage-clean-report |
| |
| This removes the generated coverage report without cleaning the cover- |
| age data itself. |
| |
| make coverage-clean-data |
| |
| This removes the captured coverage data without removing the coverage |
| files created at compile time (*.gcno). |
| |
| make coverage-clean |
| |
| This cleans all coverage data including the generated coverage report. |
| For more information about code coverage, see the gcov and lcov docu- |
| mentation. |
| |
| |
| SEE ALSO |
| |
| pcre2api(3), pcre2-config(3). |
| |
| |
| AUTHOR |
| |
| Philip Hazel |
| University Computing Service |
| Cambridge, England. |
| |
| |
| REVISION |
| |
| Last updated: 16 October 2015 |
| Copyright (c) 1997-2015 University of Cambridge. |
| ------------------------------------------------------------------------------ |
| |
| |
| PCRE2CALLOUT(3) Library Functions Manual PCRE2CALL
|