5 Lexical conventions [lex]

5.1 Separate translation [lex.separate]

The text of the program is kept in units called source files in this document.
A source file together with all the headers and source files included via the preprocessing directive #include, less any source lines skipped by any of the conditional inclusion preprocessing directives, is called a translation unit.
[Note 1:
A C++ program need not all be translated at the same time.
— end note]
[Note 2:
Previously translated translation units and instantiation units can be preserved individually or in libraries.
The separate translation units of a program communicate ([basic.link]) by (for example) calls to functions whose identifiers have external or module linkage, manipulation of objects whose identifiers have external or module linkage, or manipulation of data files.
Translation units can be separately translated and then later linked to produce an executable program.
— end note]

5.2 Phases of translation [lex.phases]

The precedence among the syntax rules of translation is specified by the following phases.9
1.
Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary.
The set of physical source file characters accepted is implementation-defined.
Any source file character not in the basic source character set is replaced by the universal-character-name that designates that character.
An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (e.g., using the \uXXXX notation), are handled equivalently except where this replacement is reverted ([lex.pptoken]) in a raw string literal.
2.
Each sequence of a backslash character (\) immediately followed by zero or more whitespace characters other than new-line followed by a new-line character is deleted, splicing physical source lines to form logical source lines.
Only the last backslash on any physical source line shall be eligible for being part of such a splice.
Except for splices reverted in a raw string literal, if a splice results in a character sequence that matches the syntax of a universal-character-name, the behavior is undefined.
A source file that is not empty and that does not end in a new-line character, or that ends in a splice, shall be processed as if an additional new-line character were appended to the file.
3.
The source file is decomposed into preprocessing tokens ([lex.pptoken]) and sequences of whitespace characters (including comments).
A source file shall not end in a partial preprocessing token or in a partial comment.10
Each comment is replaced by one space character.
New-line characters are retained.
Whether each nonempty sequence of whitespace characters other than new-line is retained or replaced by one space character is unspecified.
The process of dividing a source file's characters into preprocessing tokens is context-dependent.
[Example 1:
See the handling of < within a #include preprocessing directive.
— end example]
4.
Preprocessing directives are executed, macro invocations are expanded, and _­Pragma unary operator expressions are executed.
If a character sequence that matches the syntax of a universal-character-name is produced by token concatenation, the behavior is undefined.
A #include preprocessing directive causes the named header or source file to be processed from phase 1 through phase 4, recursively.
All preprocessing directives are then deleted.
5.
Each basic-c-char, basic-s-char, and r-char in a character-literal or a string-literal, as well as each escape-sequence and universal-character-name in a character-literal or a non-raw string literal, is encoded in the literal's associated character encoding as specified in [lex.ccon] and [lex.string].
6.
Adjacent string-literals are concatenated and a null character is appended to the result as specified in [lex.string].
7.
Whitespace characters separating tokens are no longer significant.
Each preprocessing token is converted into a token ([lex.token]).
The resulting tokens are syntactically and semantically analyzed and translated as a translation unit.
[Note 1:
The process of analyzing and translating the tokens can occasionally result in one token being replaced by a sequence of other tokens ([temp.names]).
— end note]
It is implementation-defined whether the sources for module units and header units on which the current translation unit has an interface dependency ([module.unit], [module.import]) are required to be available.
[Note 2:
Source files, translation units and translated translation units need not necessarily be stored as files, nor need there be any one-to-one correspondence between these entities and any external representation.
The description is conceptual only, and does not specify any particular implementation.
— end note]
8.
Translated translation units and instantiation units are combined as follows:
[Note 3:
Some or all of these can be supplied from a library.
— end note]
Each translated translation unit is examined to produce a list of required instantiations.
[Note 4:
This can include instantiations which have been explicitly requested ([temp.explicit]).
— end note]
The definitions of the required templates are located.
It is implementation-defined whether the source of the translation units containing these definitions is required to be available.
[Note 5:
An implementation can choose to encode sufficient information into the translated translation unit so as to ensure the source is not required here.
— end note]
All the required instantiations are performed to produce instantiation units.
[Note 6:
These are similar to translated translation units, but contain no references to uninstantiated templates and no template definitions.
— end note]
The program is ill-formed if any instantiation fails.
9.
All external entity references are resolved.
Library components are linked to satisfy external references to entities not defined in the current translation.
All such translator output is collected into a program image which contains information needed for execution in its execution environment.
9)9)
Implementations behave as if these separate phases occur, although in practice different phases can be folded together.
10)10)
A partial preprocessing token would arise from a source file ending in the first portion of a multi-character token that requires a terminating sequence of characters, such as a header-name that is missing the closing " or >.
A partial comment would arise from a source file ending with an unclosed /* comment.

5.3 Character sets [lex.charset]

The basic source character set consists of 96 characters: the space character, the control characters representing horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters:11
a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
_ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \ " '
The universal-character-name construct provides a way to name other characters.
A universal-character-name designates the character in ISO/IEC 10646 (if any) whose code point is the hexadecimal number represented by the sequence of hexadecimal-digits in the universal-character-name.
The program is ill-formed if that number is not a code point or if it is a surrogate code point.
Noncharacter code points and reserved code points are considered to designate separate characters distinct from any ISO/IEC 10646 character.
If a universal-character-name outside the c-char-sequence, s-char-sequence, or r-char-sequence of a character-literal or string-literal (in either case, including within a user-defined-literal) corresponds to a control character or to a character in the basic source character set, the program is ill-formed.12
[Note 1:
ISO/IEC 10646 code points are integers in the range (hexadecimal).
A surrogate code point is a value in the range (hexadecimal).
A control character is a character whose code point is in either of the ranges or (hexadecimal).
— end note]
The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose value is 0.
For each basic execution character set, the values of the members shall be non-negative and distinct from one another.
In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous.
The execution character set and the execution wide-character set are implementation-defined supersets of the basic execution character set and the basic execution wide-character set, respectively.
The values of the members of the execution character sets and the sets of additional members are locale-specific.
11)11)
The glyphs for the members of the basic source character set are intended to identify characters from the subset of ISO/IEC 10646 which corresponds to the ASCII character set.
However, the mapping from source file characters to the source character set (described in translation phase 1) is specified as implementation-defined, and therefore implementations must document how the basic source characters are represented in source files.
12)12)
A sequence of characters resembling a universal-character-name in an r-char-sequence ([lex.string]) does not form a universal-character-name.

5.4 Preprocessing tokens [lex.pptoken]

preprocessing-token:
header-name
import-keyword
module-keyword
export-keyword
identifier
pp-number
character-literal
user-defined-character-literal
string-literal
user-defined-string-literal
preprocessing-op-or-punc
each universal-character-name that cannot be one of the above
each non-whitespace character that cannot be one of the above
Each preprocessing token that is converted to a token shall have the lexical form of a keyword, an identifier, a literal, or an operator or punctuator.
A preprocessing token is the minimal lexical element of the language in translation phases 3 through 6.
The categories of preprocessing token are: header names, placeholder tokens produced by preprocessing import and module directives (import-keyword, module-keyword, and export-keyword), identifiers, preprocessing numbers, character literals (including user-defined character literals), string literals (including user-defined string literals), preprocessing operators and punctuators, and single universal-character-names and non-whitespace characters that do not lexically match the other preprocessing token categories.
If a single universal-character-name does not match any of the other preprocessing token categories, the program is ill-formed.
If a ' or a " character matches the last category, the behavior is undefined.
Preprocessing tokens can be separated by whitespace; this consists of comments ([lex.comment]), or whitespace characters (space, horizontal tab, new-line, vertical tab, and form-feed), or both.
As described in [cpp], in certain circumstances during translation phase 4, whitespace (or the absence thereof) serves as more than preprocessing token separation.
Whitespace can appear within a preprocessing token only as part of a header name or between the quotation characters in a character literal or string literal.
If the input stream has been parsed into preprocessing tokens up to a given character:
  • If the next character begins a sequence of characters that could be the prefix and initial double quote of a raw string literal, such as R", the next preprocessing token shall be a raw string literal.
    Between the initial and final double quote characters of the raw string, any transformations performed in phases 1 and 2 (universal-character-names and line splicing) are reverted; this reversion shall apply before any d-char, r-char, or delimiting parenthesis is identified.
    The raw string literal is defined as the shortest sequence of characters that matches the raw-string pattern
  • Otherwise, if the next three characters are <​::​ and the subsequent character is neither : nor >, the < is treated as a preprocessing token by itself and not as the first character of the alternative token <:.
  • Otherwise, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token, even if that would cause further lexical analysis to fail, except that a header-name ([lex.header]) is only formed
[Example 1: #define R "x" const char* s = R"y"; // ill-formed raw string, not "x" "y" — end example]
The import-keyword is produced by processing an import directive ([cpp.import]), the module-keyword is produced by preprocessing a module directive ([cpp.module]), and the export-keyword is produced by preprocessing either of the previous two directives.
[Note 1:
None has any observable spelling.
— end note]
[Example 2:
The program fragment 0xe+foo is parsed as a preprocessing number token (one that is not a valid integer-literal or floating-point-literal token), even though a parse as three preprocessing tokens 0xe, +, and foo can produce a valid expression (for example, if foo is a macro defined as 1).
Similarly, the program fragment 1E1 is parsed as a preprocessing number (one that is a valid floating-point-literal token), whether or not E is a macro name.
— end example]
[Example 3:
The program fragment x+++++y is parsed as x ++ ++ + y, which, if x and y have integral types, violates a constraint on increment operators, even though the parse x ++ + ++ y can yield a correct expression.
— end example]

5.5 Alternative tokens [lex.digraph]

Alternative token representations are provided for some operators and punctuators.13
In all respects of the language, each alternative token behaves the same, respectively, as its primary token, except for its spelling.14
The set of alternative tokens is defined in Table 1.
Table 1: Alternative tokens [tab:lex.digraph]
Alternative
Primary
Alternative
Primary
Alternative
Primary
<%
{
and
&&
and_­eq
&=
%>
}
bitor
|
or_­eq
|=
<:
[
or
||
xor_­eq
^=
:>
]
xor
^
not
!
%:
#
compl
~
not_­eq
!=
%:%:
##
bitand
&
13)13)
These include “digraphs” and additional reserved words.
The term “digraph” (token consisting of two characters) is not perfectly descriptive, since one of the alternative preprocessing-tokens is %:%: and of course several primary tokens contain two characters.
Nonetheless, those alternative tokens that aren't lexical keywords are colloquially known as “digraphs”.
14)14)
Thus the “stringized” values ([cpp.stringize]) of [ and <: will be different, maintaining the source spelling, but the tokens can otherwise be freely interchanged.

5.6 Tokens [lex.token]

There are five kinds of tokens: identifiers, keywords, literals,15 operators, and other separators.
Blanks, horizontal and vertical tabs, newlines, formfeeds, and comments (collectively, “whitespace”), as described below, are ignored except as they serve to separate tokens.
[Note 1:
Some whitespace is required to separate otherwise adjacent identifiers, keywords, numeric literals, and alternative tokens containing alphabetic characters.
— end note]
15)15)
Literals include strings and character and numeric literals.

5.7 Comments [lex.comment]

The characters /* start a comment, which terminates with the characters */.
These comments do not nest.
The characters // start a comment, which terminates immediately before the next new-line character.
If there is a form-feed or a vertical-tab character in such a comment, only whitespace characters shall appear between it and the new-line that terminates the comment; no diagnostic is required.
[Note 1:
The comment characters //, /*, and */ have no special meaning within a // comment and are treated just like other characters.
Similarly, the comment characters // and /* have no special meaning within a /* comment.
— end note]

5.9 Preprocessing numbers [lex.ppnumber]

Preprocessing number tokens lexically include all integer-literal tokens ([lex.icon]) and all floating-point-literal tokens ([lex.fcon]).
A preprocessing number does not have a type or a value; it acquires both after a successful conversion to an integer-literal token or a floating-point-literal token.

5.10 Identifiers [lex.name]

nondigit: one of
a b c d e f g h i j k l m
n o p q r s t u v w x y z
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z _
digit: one of
0 1 2 3 4 5 6 7 8 9
The character classes XID_Start and XID_Continue are Derived Core Properties as described by UAX #44.17
The program is ill-formed if an identifier does not conform to Normalization Form C as specified in ISO/IEC 10646.
[Note 1:
Upper- and lower-case letters are considered different for all identifiers.
— end note]
[Note 2:
In translation phase 4, identifier also includes those preprocessing-tokens ([lex.pptoken]) differentiated as keywords ([lex.key]) in the later translation phase 7 ([lex.token]).
— end note]
The identifiers in Table 2 have a special meaning when appearing in a certain context.
When referred to in the grammar, these identifiers are used explicitly rather than using the identifier grammar production.
Unless otherwise specified, any ambiguity as to whether a given identifier has a special meaning is resolved to interpret the token as a regular identifier.
Table 2: Identifiers with special meaning [tab:lex.name.special]
final
import
module
override
In addition, some identifiers are reserved for use by C++ implementations and shall not be used otherwise; no diagnostic is required.
  • Each identifier that contains a double underscore __ or begins with an underscore followed by an uppercase letter is reserved to the implementation for any use.
  • Each identifier that begins with an underscore is reserved to the implementation for use as a name in the global namespace.
17)17)
On systems in which linkers cannot accept extended characters, an encoding of the universal-character-name can be used in forming valid external identifiers.
For example, some otherwise unused character or sequence of characters can be used to encode the \u in a universal-character-name.
Extended characters can produce a long external identifier, but C++ does not place a translation limit on significant characters for external identifiers.

5.11 Keywords [lex.key]

keyword:
any identifier listed in Table 3
import-keyword
module-keyword
export-keyword
The identifiers shown in Table 3 are reserved for use as keywords (that is, they are unconditionally treated as keywords in phase 7) except in an attribute-token ([dcl.attr.grammar]).
[Note 1:
The register keyword is unused but is reserved for future use.
— end note]
Table 3: Keywords [tab:lex.key]
alignas
constinit
false
public
true
alignof
const_­cast
float
register
try
asm
continue
for
reinterpret_­cast
typedef
auto
co_­await
friend
requires
typeid
bool
co_­return
goto
return
typename
break
co_­yield
if
short
union
case
decltype
inline
signed
unsigned
catch
default
int
sizeof
using
char
delete
long
static
virtual
char8_­t
do
mutable
static_­assert
void
char16_­t
double
namespace
static_­cast
volatile
char32_­t
dynamic_­cast
new
struct
wchar_­t
class
else
noexcept
switch
while
concept
enum
nullptr
template
const
explicit
operator
this
consteval
export
private
thread_­local
constexpr
extern
protected
throw
Furthermore, the alternative representations shown in Table 4 for certain operators and punctuators ([lex.digraph]) are reserved and shall not be used otherwise.
Table 4: Alternative representations [tab:lex.key.digraph]
and
and_­eq
bitand
bitor
compl
not
not_­eq
or
or_­eq
xor
xor_­eq

5.12 Operators and punctuators [lex.operators]

The lexical representation of C++ programs includes a number of preprocessing tokens that are used in the syntax of the preprocessor or are converted into tokens for operators and punctuators:
preprocessing-operator: one of
# ## %: %:%:
operator-or-punctuator: one of
{ } [ ] ( )
<: :> <% %> ; : ...
? :: . .* -> ->* ~
! + - * / % ^ & |
= += -= *= /= %= ^= &= |=
== != < > <= >= <=> && ||
<< >> <<= >>= ++ -- ,
and or xor not bitand bitor compl
and_­eq or_­eq xor_­eq not_­eq
Each operator-or-punctuator is converted to a single token in translation phase 7.

5.13 Literals [lex.literal]

5.13.1 Kinds of literals [lex.literal.kinds]

18)18)
The term “literal” generally designates, in this document, those tokens that are called “constants” in ISO C.

5.13.2 Integer literals [lex.icon]

binary-digit: one of
0 1
octal-digit: one of
0 1 2 3 4 5 6 7
nonzero-digit: one of
1 2 3 4 5 6 7 8 9
hexadecimal-prefix: one of
0x 0X
hexadecimal-digit: one of
0 1 2 3 4 5 6 7 8 9
a b c d e f
A B C D E F
unsigned-suffix: one of
u U
long-suffix: one of
l L
long-long-suffix: one of
ll LL
size-suffix: one of
z Z
In an integer-literal, the sequence of binary-digits, octal-digits, digits, or hexadecimal-digits is interpreted as a base N integer as shown in table Table 5; the lexically first digit of the sequence of digits is the most significant.
[Note 1:
The prefix and any optional separating single quotes are ignored when determining the value.
— end note]
The hexadecimal-digits a through f and A through F have decimal values ten through fifteen.
[Example 1:
The number twelve can be written 12, 014, 0XC, or 0b1100.
The integer-literals 1048576, 1'048'576, 0X100000, 0x10'0000, and 0'004'000'000 all have the same value.
— end example]
The type of an integer-literal is the first type in the list in Table 6 corresponding to its optional integer-suffix in which its value can be represented.
An integer-literal is a prvalue.
Table 6: Types of integer-literals[tab:lex.icon.type]
none
int
int
long int
unsigned int
long long int
long int
unsigned long int
long long int
unsigned long long int
u or U
unsigned int
unsigned int
unsigned long int
unsigned long int
unsigned long long int
unsigned long long int
l or L
long int
long int
long long int
unsigned long int
long long int
unsigned long long int
Both u or U
unsigned long int
unsigned long int
and l or L
unsigned long long int
unsigned long long int
ll or LL
long long int
long long int
unsigned long long int
Both u or U
unsigned long long int
unsigned long long int
and ll or LL
z or Z
the signed integer type corresponding
the signed integer type
  to std​::​size_­t ([support.types.layout])
  corresponding to std​::​size_­t
std​::​size_­t
Both u or U
std​::​size_­t
std​::​size_­t
and z or Z
If an integer-literal cannot be represented by any type in its list and an extended integer type ([basic.fundamental]) can represent its value, it may have that extended integer type.
If all of the types in the list for the integer-literal are signed, the extended integer type shall be signed.
If all of the types in the list for the integer-literal are unsigned, the extended integer type shall be unsigned.
If the list contains both signed and unsigned types, the extended integer type may be signed or unsigned.
A program is ill-formed if one of its translation units contains an integer-literal that cannot be represented by any of the allowed types.

5.13.3 Character literals [lex.ccon]

encoding-prefix: one of
u8  u  U  L
basic-c-char:
any member of the basic source character set except the single-quote ', backslash \, or new-line character
simple-escape-sequence-char: one of
' " ? \ a b f n r t v
conditional-escape-sequence-char:
any member of the basic source character set that is not an octal-digit, a simple-escape-sequence-char, or the characters u, U, or x
A non-encodable character literal is a character-literal whose c-char-sequence consists of a single c-char that is not a numeric-escape-sequence and that specifies a character that either lacks representation in the literal's associated character encoding or that cannot be encoded as a single code unit.
A multicharacter literal is a character-literal whose c-char-sequence consists of more than one c-char.
The encoding-prefix of a non-encodable character literal or a multicharacter literal shall be absent or L.
Such character-literals are conditionally-supported.
The kind of a character-literal, its type, and its associated character encoding are determined by its encoding-prefix and its c-char-sequence as defined by Table 7.
The special cases for non-encodable character literals and multicharacter literals take precedence over their respective base kinds.
[Note 1:
The associated character encoding for ordinary and wide character literals determines encodability, but does not determine the value of non-encodable ordinary or wide character literals or ordinary or wide multicharacter literals.
The examples in Table 7 for non-encodable ordinary and wide character literals assume that the specified character lacks representation in the execution character set or execution wide-character set, respectively, or that encoding it would require more than one code unit.
— end note]
Table 7: Character literals [tab:lex.ccon.literal]
Encoding
Kind
Type
Associated char-
Example
prefix
acter encoding
none
char
encoding of
'v'
non-encodable ordinary character literal
int
the execution
'\U0001F525'
ordinary multicharacter literal
int
character set
'abcd'
L
wchar_­t
encoding of
L'w'
non-encodable wide character literal
wchar_­t
the execution
L'\U0001F32A'
wide multicharacter literal
wchar_­t
wide-character set
L'abcd'
u8
char8_­t
UTF-8
u8'x'
u
char16_­t
UTF-16
u'y'
U
char32_­t
UTF-32
U'z'
In translation phase 4, the value of a character-literal is determined using the range of representable values of the character-literal's type in translation phase 7.
A non-encodable character literal or a multicharacter literal has an implementation-defined value.
The value of any other kind of character-literal is determined as follows:
The character specified by a simple-escape-sequence is specified in Table 8.
[Note 3:
Using an escape sequence for a question mark is supported for compatibility with ISO C++ 2014 and ISO C.
— end note]
Table 8: Simple escape sequences [tab:lex.ccon.esc]
new-line
NL(LF)
\n
horizontal tab
HT
\t
vertical tab
VT
\v
backspace
BS
\b
carriage return
CR
\r
form feed
FF
\f
alert
BEL
\a
backslash
\
\\
question mark
?
\?
single quote
'
\'
double quote
"
\"

5.13.4 Floating-point literals [lex.fcon]

sign: one of
+ -
floating-point-suffix: one of
f l F L
The type of a floating-point-literal is determined by its floating-point-suffix as specified in Table 9.
Table 9: Types of floating-point-literals[tab:lex.fcon.type]
type
none
double
f or F
float
l or L
long double
In the significand, the sequence of digits or hexadecimal-digits and optional period are interpreted as a base N real number s, where N is 10 for a decimal-floating-point-literal and 16 for a hexadecimal-floating-point-literal.
[Note 1:
Any optional separating single quotes are ignored when determining the value.
— end note]
If an exponent-part or binary-exponent-part is present, the exponent e of the floating-point-literal is the result of interpreting the sequence of an optional sign and the digits as a base 10 integer.
Otherwise, the exponent e is 0.
The scaled value of the literal is for a decimal-floating-point-literal and for a hexadecimal-floating-point-literal.
[Example 1:
The floating-point-literals 49.625 and 0xC.68p+2 have the same value.
The floating-point-literals 1.602'176'565e-19 and 1.602176565e-19 have the same value.
— end example]
If the scaled value is not in the range of representable values for its type, the program is ill-formed.
Otherwise, the value of a floating-point-literal is the scaled value if representable, else the larger or smaller representable value nearest the scaled value, chosen in an implementation-defined manner.

5.13.5 String literals [lex.string]

basic-s-char:
any member of the basic source character set except the double-quote ", backslash \, or new-line character
r-char:
any member of the source character set, except a right parenthesis ) followed by
   the initial d-char-sequence (which may be empty) followed by a double quote ".
d-char:
any member of the basic source character set except:
   space, the left parenthesis (, the right parenthesis ), the backslash \, and the control characters
   representing horizontal tab, vertical tab, form feed, and newline.
The kind of a string-literal, its type, and its associated character encoding are determined by its encoding prefix and sequence of s-chars or r-chars as defined by Table 10 where n is the number of encoded code units as described below.
Table 10: String literals [tab:lex.string.literal]
Encoding
Kind
Type
Associated
Examples
prefix
character encoding
none
array of n
const char
encoding of the execution character set
"ordinary string"
R"(ordinary raw string)"
L
array of n
const wchar_­t
encoding of the execution wide-character set
L"wide string"
LR"w(wide raw string)w"
u8
array of n
const char8_­t
UTF-8
u8"UTF-8 string"
u8R"x(UTF-8 raw string)x"
u
array of n
const char16_­t
UTF-16
u"UTF-16 string"
uR"y(UTF-16 raw string)y"
U
array of n
const char32_­t
UTF-32
U"UTF-32 string"
UR"z(UTF-32 raw string)z"
A string-literal that has an R in the prefix is a raw string literal.
The d-char-sequence serves as a delimiter.
The terminating d-char-sequence of a raw-string is the same sequence of characters as the initial d-char-sequence.
A d-char-sequence shall consist of at most 16 characters.
[Note 1:
The characters '(' and ')' are permitted in a raw-string.
Thus, R"delimiter((a|b))delimiter" is equivalent to "(a|b)".
— end note]
[Note 2:
A source-file new-line in a raw string literal results in a new-line in the resulting execution string literal.
Assuming no whitespace at the beginning of lines in the following example, the assert will succeed: const char* p = R"(a\ b c)"; assert(std::strcmp(p, "a\\\nb\nc") == 0);
— end note]
[Example 1:
The raw string R"a( )\ a" )a" is equivalent to "\n)\\\na\"\n".
The raw string R"(x = "\"y\"")" is equivalent to "x = \"\\\"y\\\"\"".
— end example]
Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals.
In translation phase 6 ([lex.phases]), adjacent string-literals are concatenated.
If both string-literals have the same encoding-prefix, the resulting concatenated string-literal has that encoding-prefix.
If one string-literal has no encoding-prefix, it is treated as a string-literal of the same encoding-prefix as the other operand.
Any other concatenations are ill-formed.
[Note 3:
This concatenation is an interpretation, not a conversion.
Because the interpretation happens in translation phase 6 (after the string literal contents have been encoded in the string-literal's associated character encoding), a string-literal's initial rawness has no effect on the interpretation or well-formedness of the concatenation.
— end note]
Table 11 has some examples of valid concatenations.
Table 11: String literal concatenations [tab:lex.string.concat]
Source
Means
Source
Means
Source
Means
u"a"
u"b"
u"ab"
U"a"
U"b"
U"ab"
L"a"
L"b"
L"ab"
u"a"
"b"
u"ab"
U"a"
"b"
U"ab"
L"a"
"b"
L"ab"
"a"
u"b"
u"ab"
"a"
U"b"
U"ab"
"a"
L"b"
L"ab"
Characters in concatenated strings are kept distinct.
[Example 2:
"\xA" "B" contains the two characters '\xA' and 'B' after concatenation (and not the single hexadecimal character '\xAB').
— end example]
In translation phase 6 ([lex.phases]), after adjacent string-literals are concatenated, a null character is appended to the result.
Evaluating a string-literal results in a string literal object with static storage duration ([basic.stc]).
Whether all string-literals are distinct (that is, are stored in nonoverlapping objects) and whether successive evaluations of a string-literal yield the same or a different object is unspecified.
[Note 4:
The effect of attempting to modify a string literal object is undefined.
— end note]
String literal objects are initialized with the sequence of code unit values corresponding to the string-literal's sequence of s-chars (for a non-raw string literal) and r-chars (for a raw string literal) in order as follows:
  • The sequence of characters denoted by each contiguous sequence of basic-s-chars, r-chars, simple-escape-sequences ([lex.ccon]), and universal-character-names ([lex.charset]) is encoded to a code unit sequence using the string-literal's associated character encoding.
    If a character lacks representation in the associated character encoding, then:
    When encoding a stateful character encoding, implementations should encode the first such sequence beginning with the initial encoding state and encode subsequent sequences beginning with the final encoding state of the prior sequence.
    [Note 5:
    The encoded code unit sequence can differ from the sequence of code units that would be obtained by encoding each character independently.
    — end note]
  • Each numeric-escape-sequence ([lex.ccon]) that specifies an integer value v contributes a single code unit with a value as follows:
    When encoding a stateful character encoding, these sequences should have no effect on encoding state.
  • Each conditional-escape-sequence ([lex.ccon]) contributes an implementation-defined code unit sequence.
    When encoding a stateful character encoding, it is implementation-defined what effect these sequences have on encoding state.

5.13.6 Boolean literals [lex.bool]

boolean-literal:
false
true
The Boolean literals are the keywords false and true.
Such literals are prvalues and have type bool.

5.13.7 Pointer literals [lex.nullptr]

The pointer literal is the keyword nullptr.
It is a prvalue of type std​::​nullptr_­t.
[Note 1:
std​::​nullptr_­t is a distinct type that is neither a pointer type nor a pointer-to-member type; rather, a prvalue of this type is a null pointer constant and can be converted to a null pointer value or null member pointer value.
— end note]

5.13.8 User-defined literals [lex.ext]

If a token matches both user-defined-literal and another literal kind, it is treated as the latter.
[Example 1:
123_­km is a user-defined-literal, but 12LL is an integer-literal.
— end example]
The syntactic non-terminal preceding the ud-suffix in a user-defined-literal is taken to be the longest sequence of characters that could match that non-terminal.
A user-defined-literal is treated as a call to a literal operator or literal operator template ([over.literal]).
To determine the form of this call for a given user-defined-literal L with ud-suffix X, first let S be the set of declarations found by unqualified lookup for the literal-operator-id whose literal suffix identifier is X ([basic.lookup.unqual]).
S shall not be empty.
If L is a user-defined-integer-literal, let n be the literal without its ud-suffix.
If S contains a literal operator with parameter type unsigned long long, the literal L is treated as a call of the form operator "" X(nULL)
Otherwise, S shall contain a raw literal operator or a numeric literal operator template ([over.literal]) but not both.
If S contains a raw literal operator, the literal L is treated as a call of the form operator "" X("n")
Otherwise (S contains a numeric literal operator template), L is treated as a call of the form operator "" X<'', '', ... ''>() where n is the source character sequence .
[Note 1:
The sequence can only contain characters from the basic source character set.
— end note]
If L is a user-defined-floating-point-literal, let f be the literal without its ud-suffix.
If S contains a literal operator with parameter type long double, the literal L is treated as a call of the form operator "" X(fL)
Otherwise, S shall contain a raw literal operator or a numeric literal operator template ([over.literal]) but not both.
If S contains a raw literal operator, the literal L is treated as a call of the form operator "" X("f")
Otherwise (S contains a numeric literal operator template), L is treated as a call of the form operator "" X<'', '', ... ''>() where f is the source character sequence .
[Note 2:
The sequence can only contain characters from the basic source character set.
— end note]
If L is a user-defined-string-literal, let str be the literal without its ud-suffix and let len be the number of code units in str (i.e., its length excluding the terminating null character).
If S contains a literal operator template with a non-type template parameter for which str is a well-formed template-argument, the literal L is treated as a call of the form operator "" X<str>()
Otherwise, the literal L is treated as a call of the form operator "" X(str, len)
If L is a user-defined-character-literal, let ch be the literal without its ud-suffix.
S shall contain a literal operator whose only parameter has the type of ch and the literal L is treated as a call of the form operator "" X(ch)
[Example 2: long double operator "" _w(long double); std::string operator "" _w(const char16_t*, std::size_t); unsigned operator "" _w(const char*); int main() { 1.2_w; // calls operator "" _­w(1.2L) u"one"_w; // calls operator "" _­w(u"one", 3) 12_w; // calls operator "" _­w("12") "two"_w; // error: no applicable literal operator } — end example]
In translation phase 6 ([lex.phases]), adjacent string-literals are concatenated and user-defined-string-literals are considered string-literals for that purpose.
During concatenation, ud-suffixes are removed and ignored and the concatenation process occurs as described in [lex.string].
At the end of phase 6, if a string-literal is the result of a concatenation involving at least one user-defined-string-literal, all the participating user-defined-string-literals shall have the same ud-suffix and that suffix is applied to the result of the concatenation.
[Example 3: int main() { L"A" "B" "C"_x; // OK: same as L"ABC"_­x "P"_x "Q" "R"_y; // error: two different ud-suffixes } — end example]