C++ supports various string and character types,

2024-12-17 技术教程

String and Character Literals (C++)

Visual Studio 2015

Other Versions

C++ supports various string and character types, and provides ways to express literal values of each of these types. In your source code, you express the content of your character and string literals using a character set. Universal character names and escape characters allow you to express any string using only the basic source character set. A raw string literal enables you to avoid using escape characters, and can be used to express all types of string literals. You can also create std::string literals without having to perform extra construction or conversion steps.

C++

#include<string>usingnamespacestd::string_literals;//enabless-suffixforstd::stringliteralsintmain(){//Characterliteralsautoc0='A';//charautoc1=u8'A';//charautoc2=L'A';//wchar_tautoc3=u'A';//char16_tautoc4=U'A';//char32_t//Stringliteralsautos0="hello";//constchar*autos1=u8"hello";//constchar*,encodedasUTF-8autos2=L"hello";//constwchar_t*autos3=u"hello";//constchar16_t*,encodedasUTF-16autos4=U"hello";//constchar32_t*,encodedasUTF-32//Rawstringliteralscontainingunescaped\and"autoR0=R"("Hello\world")";//constchar*autoR1=u8R"("Hello\world")";//constchar*,encodedasUTF-8autoR2=LR"("Hello\world")";//constwchar_t*autoR3=uR"("Hello\world")";//constchar16_t*,encodedasUTF-16autoR4=UR"("Hello\world")";//constchar32_t*,encodedasUTF-32//Combiningstringliteralswithstandards-suffixautoS0="hello"s;//std::stringautoS1=u8"hello"s;//std::stringautoS2=L"hello"s;//std::wstringautoS3=u"hello"s;//std::u16stringautoS4=U"hello"s;//std::u32string//Combiningrawstringliteralswithstandards-suffixautoS5=R"("Hello\world")"s;//std::stringfromarawconstchar*autoS6=u8R"("Hello\world")"s;//std::stringfromarawconstchar*,encodedasUTF-8autoS7=LR"("Hello\world")"s;//std::wstringfromarawconstwchar_t*autoS8=uR"("Hello\world")"s;//std::u16stringfromarawconstchar16_t*,encodedasUTF-16autoS9=UR"("Hello\world")"s;//std::u32stringfromarawconstchar32_t*,encodedasUTF-32}

String literals can have no prefix, oru8,L,u, andUprefixes to denote narrow character (single-byte or multi-byte), UTF-8, wide character (UCS-2 or UTF-16), UTF-16 and UTF-32 encodings, respectively. A raw string literal can haveR,u8R,LR,uRandURprefixes for the raw version equivalents of these encodings. To create temporary or static std::string values, you can use string literals or raw string literals with anssuffix. For more information, see the String literals section below. For more information on the basic source character set, universal character names, and using characters from extended codepages in your source code, seeCharacter Sets.

Character literals

Acharacter literalis composed of a constant character. It is represented by the character surrounded by single quotation marks. There are four kinds of character literals:

Narrow-character literals of typechar, for example'a'

Wide-character literals of typewchar_t, for exampleL'a'

Wide-character literals of typechar16_t, for exampleu'a'

Wide-character literals of typechar32_t, for exampleU'a'

The character used for a character literal may be any character, except for the reserved characters backslash ('\'), single quotation mark ('), or new line. Reserved characters can be specified by using an escape sequence. Characters may be specified by using universal character names, as long as the type is large enough to hold the character.

Escape Sequences

There are three kinds of escape sequences: simple, octal, and hexadecimal. Escape sequences may be any of the following:

Value

Escape sequence

Value

Escape sequence

newline

backslash

horizontal tab

question mark

? or \?

vertical tab

single quote

backspace

double quote

carriage return

the null character

form feed

octal

\ooo

alert (bell)

hexadecimal

\xhhh

The following code shows some examples of escaped characters using narrow string literals. The same syntax is valid for the other string literal types.

C++

#include<iostream>usingnamespacestd;intmain(){charnewline='\n';chartab='\t';charbackspace='\b';charbackslash='\\';charnullChar='\0';cout<<"Newlinecharacter:"<<newline<<"ending"<<endl;//Newlinecharacter://endingcout<<"Tabcharacter:"<<tab<<"ending"<<endl;//Tabcharacter:endingcout<<"Backspacecharacter:"<<backspace<<"ending"<<endl;//Backspacecharacter:endingcout<<"Backslashcharacter:"<<backslash<<"ending"<<endl;//Backslashcharacter:\endingcout<<"Nullcharacter:"<<nullChar<<"ending"<<endl;//Nullcharacter:ending}

Microsoft Specific

To create a value from an unprefixed character literal, the compiler converts the character or character sequence between single quotes into 8-bit values within a 32-bit integer. Multiple characters in the literal fill corresponding bytes as needed from high-order to low-order. To create acharvalue, the compiler takes the low-order byte. To create awchar_torchar16_tvalue, the compiler takes the low-order word. The compiler warns that the result is truncated if any bits are set above the assigned byte or word.

C++

charc0='abcd';//C4305,C4309,truncatesto'd'wchar_tw0='abcd';//C4305,C4309,truncatesto'\x6364'

An octal escape sequence is a backslash followed by a sequence of up to 3 octal digits. The behavior of an octal escape sequence that appears to contain more than three digits is treated as a 3-digit octal sequence followed by the subsequent digits as characters; this can give surprising results. For example:

C++

charc1='\100';//'@'charc2='\1000';//C4305,C4309,truncatesto'0'

Escape sequences that appear to contain non-octal characters are evaluated as an octal sequence up to the last octal character, followed by the remaining characters. For example:

C++

charc3='\009';//'9'charc4='\089';//C4305,C4309,truncatesto'9'charc5='\qrs';//C4129,C4305,C4309,truncatesto's'

A hexadecimal escape sequence is a backslash followed by the characterx, followed by a sequence of hexadecimal digits. An escape sequence that contains no hexadecimal digits causes compiler error C2153: "hex literals must have at least one hex digit". Leading zeroes are ignored. An escape sequence that appears to have hexadecimal and non-hexadecimal characters is evaluated as a hexadecimal escape sequence up to the last hexadecimal character, followed by the non-hexadecimal characters. In an unprefixed or u8-prefixed narrow character literal, the highest hexadecimal value is 0xFF. In an L-prefixed or u-prefixed wide character literal, the highest hexadecimal value is 0xFFFF. In a U-prefixed wide character literal, the highest hexadecimal value is 0xFFFFFFFF.

C++

charc6='\x0050';//'P'charc7='\x0pqr';//C4305,C4309,truncatesto'r'

If a wide character literal prefixed withLcontains more than one character, the value is taken from the first character. Subsequent characters are ignored, unlike the behavior of the equivalent unprefixed narrow character literal.

C++

wchar_tw1=L'\100';//L'@'wchar_tw2=L'\1000';//C4066L'@',0ignoredwchar_tw3=L'\009';//C4066L'\0',9ignoredwchar_tw4=L'\089';//C4066L'\0',89ignoredwchar_tw5=L'\qrs';//C4129,C4066L'q'escape,rsignoredwchar_tw6=L'\x0050';//L'P'wchar_tw7=L'\x0pqr';//C4066L'\0',pqrignored

END Microsoft Specific

The backslash character (\) is a line-continuation character when it is placed at the end of a line. If you want a backslash character to appear as a character literal, you must type two backslashes in a row (\\). For more information about the line continuation character, seePhases of Translation.

Universal character names

In character literals and native (non-raw) string literals, any character may be represented by a universal character name. Universal character names are formed by a prefix \U followed by an eight-digit Unicode code point, or by a prefix \u followed by a four digit Unicode code point. All eight or four digits, respectively, must be present to make a well-formed universal character name.

C++

charu1='A';//'A'charu2='\101';//octal,'A'charu3='\x41';//hexadecimal,'A'charu4='\u0041';//\uUCN'A'charu5='\U00000041';//\UUCN'A'

Surrogate Pairs

Universal character names cannot encode values in the surrogate code point range D800-DFFF. For Unicode surrogate pairs, specify the universal character name by using\UNNNNNNNN, where NNNNNNNN is the eight-digit code point for the character. The compiler generates a surrogate pair if required.

In C++03, the language only allowed a subset of characters to be represented by their universal character names, and allowed some universal character names that didn’t actually represent any valid Unicode characters. This was fixed in the C++11 standard. In C++11, both character and string literals and identifiers can use universal character names. For more information on universal character names, seeCharacter Sets. For more information about Unicode, seeUnicode. For more information about surrogate pairs, seeSurrogate Pairs and Supplementary Characters.

String literals

A string literal represents a sequence of characters that together form a null-terminated string. The characters must be enclosed between double quotation marks. There are the following kinds of string literals:

Narrow String Literals

A narrow string literal is a non-prefixed, double-quote delimited, null-terminated array of typeconstchar[n], where n is the length of the array in bytes. A narrow string literal may contain any graphic character except the double quotation mark ("), backslash (\), or newline character. A narrow string literal may also contain the escape sequences listed above, and universal character names that fit in a byte.

C++

constchar*narrow="abcd";//representsthestring:yes\noconstchar*escaped="yes\\no";

UTF-8 encoded strings

A UTF-8 encoded string is a u8-prefixed, double-quote delimited, null-terminated array of typeconstchar[n], where n is the length of the encoded array in bytes. A u8-prefixed string literal may contain any graphic character except the double quotation mark ("), backslash (\), or newline character. A u8-prefixed string literal may also contain the escape sequences listed above, and any universal character name.

C++

constchar*str1=u8"HelloWorld";constchar*str2=u8"\U0001F607isO:-)";Wide String Literals

A wide string literal is a null-terminated array of constantwchar_tthat is prefixed by 'L' and contains any graphic character except the double quotation mark ("), backslash (\), or newline character. A wide string literal may contain the escape sequences listed above and any universal character name.

C++

constwchar_t*wide=L"zyxw";constwchar_t*newline=L"hello\ngoodbye";

char16_t and char32_t (C++11)

C++11 introduces the portablechar16_t(16-bit Unicode) andchar32_t(32-bit Unicode) character types:

C++

autos3=u"hello";//constchar16_t*autos4=U"hello";//constchar32_t*Raw String Literals (C++11)

A raw string literal is a null-terminated array—of any character type—that contains any graphic character, including the double quotation mark ("), backslash (\), or newline character. Raw string literals are often used in regular expressions that use character classes, and in HTML strings and XML strings. For examples, see the following article:Bjarne Stroustrup's FAQ on C++11.

C++

//representsthestring:Anunescaped\characterconstchar*raw_narrow=R"(Anunescaped\character)";constwchar_t*raw_wide=LR"(Anunescaped\character)";constchar*raw_utf8=u8R"(Anunescaped\character)";constchar16_t*raw_utf16=uR"(Anunescaped\character)";constchar32_t*raw_utf32=UR"(Anunescaped\character)";

A delimiter is a user-defined sequence of up to 16 characters that immediately precedes the opening parenthesis of a raw string literal and immediately follows its closing parenthesis. For example, inR"abc(Hello"\()abc"the delimiter sequence isabcand the string content isHello"\(. You can use a delimiter to disambiguate raw strings that contain both double quotation marks and parentheses. This causes a compiler error:

C++

//meanttorepresentthestring:)”constchar*bad_parens=R"()")";//errorC2059

But a delimiter resolves it:

C++

constchar*good_parens=R"xyz()")xyz";

You can construct a raw string literal in which there is a newline (not the escaped character) in the source:

C++

//representsthestring:hello//goodbyeconstwchar_t*newline=LR"(hellogoodbye)";std::string Literals (C++14)

std::string literals are Standard Library implementations of user-defined literals (see below) that are represented as "xyx"s (with assuffix). This kind of string literal produces a temporary object of type std::string, std::wstring, std::u32string or std::u16string depending on the prefix that is specified. When no prefix is used, as above, a std::string is produced. L"xyz"s produces a std::wstring. u"xyz"s produces astd::u16string, and U"xyz"s produces astd::u32string.

C++

//#include<string>//usingnamespacestd::string_literals;stringstr{"hello"s};stringstr2{u8"HelloWorld"};wstringstr3{L"hello"s};u16stringstr4{u"hello"s};u32stringstr5{U"hello"s};

The s suffix may also be used on raw string literals:

C++

u32stringstr6{UR"(Shesaid"hello.")"s};

std::string literals are defined in the namespacestd::literals::string_literalsin the <string> header file. Becausestd::literals::string_literals, andstd::literalsare both declared asinline namespaces,std::literals::string_literalsis automatically treated as if it belonged directly in namespacestd.

Size of String Literals

For ANSI char* strings and other single-byte encodings (not UTF-8), the size (in bytes) of a string literal is the number of characters plus 1 for the terminating null character. For all other string types, the size is not strictly related to the number of characters. UTF-8 uses up to four char elements to encode somecode units, and char16_t or wchar_t encoded as UTF-16 may use two elements (for a total of four bytes) to encode a singlecode unit. This example shows the size of a wide string literal in bytes:

C++

constwchar_t*str=L"Hello!";constsize_tbyteSize=(wcslen(str)+1)*sizeof(wchar_t);

Notice thatstrlen()andwcslen()do not include the size of the terminating null character, whose size is equal to the element size of the string type: one byte on a char* string, two bytes on wchar_t* or char16_t* strings, and four bytes on char32_t* strings.

The maximum length of a string literal is 65535 bytes. This limit applies to both narrow string literals and wide string literals.

Modifying String Literals

Because string literals (not including std:string literals) are constants, trying to modify them—for example, str[2] = 'A'—causes a compiler error.

Microsoft Specific

In Visual C++ you can use a string literal to initialize a pointer to non-constcharorwchar_t. This is allowed in C99 code, but is deprecated in C++98 and removed in C++11. An attempt to modify the string causes an access violation, as in this example:

C++

wchar_t*str=L"hello";str[2]=L'a';//run-timeerror:accessviolation

You can cause the compiler to emit an error when a string literal is converted to a non_const character pointer when you set the/Zc:strictStrings (Disable string literal type conversion)compiler option. We recommend it for standards-compliant portable code. It is also a good practice to use theautokeyword to declare string literal-initialized pointers, because it resolves to the correct (const) type. For example, this code example catches an attempt to write to a string literal at compile time:

C++

autostr=L"hello";str[2]=L'a';//C3892:youcannotassigntoavariablethatisconst.

In some cases, identical string literals may be pooled to save space in the executable file. In string-literal pooling, the compiler causes all references to a particular string literal to point to the same location in memory, instead of having each reference point to a separate instance of the string literal. To enable string pooling, use the/GFcompiler option.

End Microsoft Specific

Concatenating adjacent string literals

Adjacent wide or narrow string literals are concatenated. This declaration:

C++

charstr[]="12""34";

is identical to this declaration:

C++

charatr[]="1234";

and to this declaration:

C++

charatr[]="12\34";

Using embedded hexadecimal escape codes to specify string literals can cause unexpected results. The following example seeks to create a string literal that contains the ASCII 5 character, followed by the characters f, i, v, and e:

C++

"\x05five"

The actual result is a hexadecimal 5F, which is the ASCII code for an underscore, followed by the characters i, v, and e. To get the correct result, you can use one of these:

C++

"\005five"//Useoctalliteral."\x05""five"//Usestringsplicing.

std::string literals, because they are std::string types, can be concatenated with the + operator that is defined forbasic_stringtypes. They can also be concatenated in the same way as adjacent string literals. In both cases, the string encoding and the suffix must match:

C++

autox1="hello""""world";//OKautox2=U"hello"""L"world";//C2308:disagreeonprefixautox3=u8"hello"""su8"world"s;//OK,agreeonprefixesandsuffixesautox4=u8"hello"""su8"world"z;//C3688,disagreeonsuffixesString literals with universal character names

Native (non-raw) string literals may use universal character names to represent any character, as long as the universal character name can be encoded as one or more characters in the string type. For example, a universal character name representing an extended character cannot be encoded in a narrow string using the ANSI code page, but it can be encoded in narrow strings in some multi-byte code pages, or in UTF-8 strings, or in a wide string. In C++11, Unicode support is extended by the char16_t* and char32_t* string types:

C++

//ASCIIsmilingfaceconstchar*s1=":-)";//UTF-16(onWindows)encodedWINKINGFACE(U+1F609)constwchar_t*s2=L"=\U0001F609is;-)";//UTF-8encodedSMILINGFACEWITHHALO(U+1F607)constchar*s3=u8"=\U0001F607isO:-)";//UTF-16encodedSMILINGFACEWITHOPENMOUTH(U+1F603)constchar16_t*s4=u"=\U0001F603is:-D";//UTF-32encodedSMILINGFACEWITHSUNGLASSES(U+1F60E)constchar32_t*s5=U"=\U0001F60EisB-)";