5 Lexical conventions [lex]

5.13 Literals [lex.literal]

5.13.5 String literals [lex.string]

basic-s-char:
any member of the basic source character set except the double-quote ", backslash \, or new-line character
r-char:
any member of the source character set, except a right parenthesis ) followed by
   the initial d-char-sequence (which may be empty) followed by a double quote ".
d-char:
any member of the basic source character set except:
   space, the left parenthesis (, the right parenthesis ), the backslash \, and the control characters
   representing horizontal tab, vertical tab, form feed, and newline.
The kind of a string-literal, its type, and its associated character encoding are determined by its encoding prefix and sequence of s-chars or r-chars as defined by Table 12 where n is the number of encoded code units as described below.
Table 12: String literals [tab:lex.string.literal]
Encoding
Kind
Type
Associated
Examples
prefix
character encoding
none
array of n
const char
encoding of the execution character set
"ordinary string"
R"(ordinary raw string)"
L
array of n
const wchar_­t
encoding of the execution wide-character set
L"wide string"
LR"w(wide raw string)w"
u8
array of n
const char8_­t
UTF-8
u8"UTF-8 string"
u8R"x(UTF-8 raw string)x"
u
array of n
const char16_­t
UTF-16
u"UTF-16 string"
uR"y(UTF-16 raw string)y"
U
array of n
const char32_­t
UTF-32
U"UTF-32 string"
UR"z(UTF-32 raw string)z"
A string-literal that has an R in the prefix is a raw string literal.
The d-char-sequence serves as a delimiter.
The terminating d-char-sequence of a raw-string is the same sequence of characters as the initial d-char-sequence.
A d-char-sequence shall consist of at most 16 characters.
[Note 1:
The characters '(' and ')' are permitted in a raw-string.
Thus, R"delimiter((a|b))delimiter" is equivalent to "(a|b)".
— end note]
[Note 2:
A source-file new-line in a raw string literal results in a new-line in the resulting execution string literal.
Assuming no whitespace at the beginning of lines in the following example, the assert will succeed: const char* p = R"(a\ b c)"; assert(std::strcmp(p, "a\\\nb\nc") == 0);
— end note]
[Example 1:
The raw string R"a( )\ a" )a" is equivalent to "\n)\\\na\"\n".
The raw string R"(x = "\"y\"")" is equivalent to "x = \"\\\"y\\\"\"".
— end example]
Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals.
In translation phase 6 ([lex.phases]), adjacent string-literals are concatenated.
If both string-literals have the same encoding-prefix, the resulting concatenated string-literal has that encoding-prefix.
If one string-literal has no encoding-prefix, it is treated as a string-literal of the same encoding-prefix as the other operand.
If a UTF-8 string literal token is adjacent to a wide string literal token, the program is ill-formed.
Any other concatenations are conditionally-supported with implementation-defined behavior.
[Note 3:
This concatenation is an interpretation, not a conversion.
Because the interpretation happens in translation phase 6 (after the string literal contents have been encoded in the string-literal's associated character encoding), a string-literal's initial rawness has no effect on the interpretation or well-formedness of the concatenation.
— end note]
Table 13 has some examples of valid concatenations.
Table 13: String literal concatenations [tab:lex.string.concat]
Source
Means
Source
Means
Source
Means
u"a"
u"b"
u"ab"
U"a"
U"b"
U"ab"
L"a"
L"b"
L"ab"
u"a"
"b"
u"ab"
U"a"
"b"
U"ab"
L"a"
"b"
L"ab"
"a"
u"b"
u"ab"
"a"
U"b"
U"ab"
"a"
L"b"
L"ab"
Characters in concatenated strings are kept distinct.
[Example 2:
"\xA" "B" contains the two characters '\xA' and 'B' after concatenation (and not the single hexadecimal character '\xAB').
— end example]
In translation phase 6 ([lex.phases]), after adjacent string-literals are concatenated, a null character is appended to the result.
Evaluating a string-literal results in a string literal object with static storage duration ([basic.stc]).
Whether all string-literals are distinct (that is, are stored in nonoverlapping objects) and whether successive evaluations of a string-literal yield the same or a different object is unspecified.
[Note 4:
The effect of attempting to modify a string literal object is undefined.
— end note]
String literal objects are initialized with the sequence of code unit values corresponding to the string-literal's sequence of s-chars (for a non-raw string literal) and r-chars (for a raw string literal) in order as follows:
  • The sequence of characters denoted by each contiguous sequence of basic-s-chars, r-chars, simple-escape-sequences ([lex.ccon]), and universal-character-names ([lex.charset]) is encoded to a code unit sequence using the string-literal's associated character encoding.
    If a character lacks representation in the associated character encoding, then:
    When encoding a stateful character encoding, implementations should encode the first such sequence beginning with the initial encoding state and encode subsequent sequences beginning with the final encoding state of the prior sequence.
    [Note 5:
    The encoded code unit sequence can differ from the sequence of code units that would be obtained by encoding each character independently.
    — end note]
  • Each numeric-escape-sequence ([lex.ccon]) that specifies an integer value v contributes a single code unit with a value as follows:
    When encoding a stateful character encoding, these sequences should have no effect on encoding state.
  • Each conditional-escape-sequence ([lex.ccon]) contributes an implementation-defined code unit sequence.
    When encoding a stateful character encoding, it is implementation-defined what effect these sequences have on encoding state.