字符串(Strings)的内容通常是人可读(human-readable)的。
在处理区块链时,正确处理人可读(human-readable)和人提供(human-provided)的数据 对于防范资金的损失、权限的错误等非常重要。
在Solidity存储一个字符串时,前缀是字符串的长度(256位,即32个字节),接着的才是字符串的内容。 这意味着即使很短的字符串也至少需要2个words(64个字节)的存储空间。
大多数情况下,我们处理的是较短的字符串,所以我们不用字符串的长度作为前缀, 而是通过null-terminate处理字符串并将其放入一个word(32个字节)中。 由于null termination只需要一个字节,因此可以在一个word中存储长度不超过31字节的字符串。
长度为31bytes的字符串可能包含少于31个characters,因为UTF-8需要多个字节来编码国际字符。
返回对Bytes32
编码数据进行解码的字符串。
返回text的bytes32
字符串表示形式。如果text长度超过31字节,则会抛出错误。
返回text的UTF-8字节表示,也可通过UnicodeNormalizationForm form对其进行标准化。
返回text的codepoints数组,也可通过UnicodeNormalizationForm form对其进行标准化。
This function correctly splits each user-perceived character into its codepoint, accounting for surrogate pairs. This should not be confused with string.split("")
, which destroys surrogate pairs, splitting between each UTF-16 codeunit instead.
返回由aBytesLike的UTF-8字节表示的字符串。
onError 是一个自定义的UTF-8 error函数, 如果没有指定它默认为error函数,在发生UTF-8 error时会抛出一个错误。
UnicodeNormalizationForm
在标准化UTF-8数据时,有几种常用的形式, 它们能够以稳定的方式对字符串进行比较或哈希。
保持当前的 normalization form。
The Composed Normalization Form. This form uses single codepoints which represent the fully composed character.
For example, the é is a single codepoint, 0x00e9
.
The Decomposed Normalization Form. This form uses multiple codepoints (when necessary) to compose a character.
For example, the é is made up of two codepoints, "0x0065"
(which is the letter "e"
) and "0x0301"
which is a special diacritic UTF-8 codepoint which indicates the previous character should have an acute accent.
The Composed Normalization Form with Canonical Equivalence. The Canonical representation folds characters which have the same syntactic representation but different semantic meaning.
For example, the Roman Numeral I, which has a UTF-8 codepoint "0x2160"
, is folded into the capital letter I, "0x0049"
.
The Decomposed Normalization Form with Canonical Equivalence. See NFKC for more an example.
Only certain specified characters are folded in Canonical Equivalence, and thus it should not be considered a method to achieve any level of security from homoglyph attacks.
在将字符串转换为其代码点时,可能会出现无效的字节序列。 由于某些情况下可能需要特定的方法来处理UTF-8错误,可以使用自定义错误处理函数,该函数具有如下签名:
reason是一个utf-8的错误原因,offset是第一次遇到error字节的索引, output已经处理过的(并且可能被修改)的代码点列表和一些错误原因, badCodepoint表示当前计算的代码点,如果它的值是无效的该签名将会被拒绝。
这个函数返回要跳过的字节数,因为offset的值已经被使用过了。
UTF-8 错误原因
遇到一个字节,该字节开头的UTF-8字节序列是无效的。
A UTF-8 sequence was begun, but did not have enough continuation bytes for the sequence. For this error the ofset is the index at which a continuation byte was expected.
The computed codepoint is outside the range for valid UTF-8 codepoints (i.e. the codepoint is greater than 0x10ffff). This reason will pass the computed badCountpoint into the custom error function.
Due to the way UTF-8 allows variable length byte sequences to be used, it is possible to have multiple representations of the same character, which means overlong sequences allow for a non-distinguished string to be formed, which can impact security as multiple strings that are otherwise equal can have different hashes.
Generally, overlong sequences are an attempt to circumvent some part of security, but in rare cases may be produced by lazy libraries or used to encode the null terminating character in a way that is safe to include in a char*
.
This reason will pass the computed badCountpoint into the custom error function, which is actually a valid codepoint, just one that was arrived at through unsafe methods.
The string does not have enough characters remaining for the length of this sequence.
This error is similar to BAD_PREFIX, since a continuation byte cannot begin a valid sequence, but many may wish to process this differently. However, most developers would want to trap this and perform the same operation as a BAD_PREFIX.
The computed codepoint represents a value reserved for UTF-16 surrogate pairs. This reason will pass the computed surrogate half badCountpoint into the custom error function.
There are already several functions available for the most common situations.
The will throw an error on any error with a UTF-8 sequence, including invalid prefix bytes, overlong sequences, UTF-16 surrogate pairs.
This will drop all invalid sequences (by consuming invalid prefix bytes and any following continuation bytes) from the final string as well as permit overlong sequences to be converted to their equivalent string.
This will replace all invalid sequences (by consuming invalid prefix bytes and any following continuation bytes) with the UTF-8 Replacement Character, (i.e. U+FFFD).