计算字符串的长度和宽度 – 与Unicode的乐趣

Let's calculate string length in Rust¹! How many characters there really are in a string and how much space these strings take up when displayed.

让我们在 Rust 中计算字符串长度¹!字符串中实际上有多少个字符,以及这些字符串在显示时占用多少空间。

¹: This article may also apply to other languages. This article will only focus on strings with the Rust default UTF-8 encoding. There's an appendix for how it works in Ruby at the end of the article. I'm simplifying this content to keep the article short.

¹: 本文也可能适用于其他语言。本文将仅关注使用 Rust 默认 UTF-8 编码的字符串。文章末尾有关于 Ruby 的工作原理的附录。我在简化此内容以保持文章简短。

String.len()

String.len()

The first function you'll probably come across is String.len()/str/len(), or length of string. Given the string "abc" it will returns the length of three. All looks good so far.

你可能首先遇到的函数是String.len()/str/len(),或者字符串的长度。给定字符串"abc",它将返回长度为三。一切看起来都很好。

That is, until we take a closer look at the docs for this function. It says the following:

也就是说,直到我们仔细查看该函数的文档。它说如下:

Returns the length of this string, in bytes, not chars or graphemes. In other words, it might not be what a human considers the length of the string.

返回这个字符串的长度,以字节为单位,而不是char或字形。换句话说,这可能不是人类认为的字符串长度。

Source: String.len()

来源:String.len()

The Rust docs are giving us a warning here that it may not always return the number we'd expect. It will return the string length in bytes, and it sounds like not all characters are counted as one byte.

Rust 文档在这里给我们发出警告,可能并不总是返回我们预期的数字。它将返回字符串的字节长度,听起来并不是所有字符都被计算为一个字节。

Let's try something that's not just plain "a" through "z", but something like a character with an accent.

让我们尝试一些不仅仅是简单的 "a" 到 "z" 的东西,而是像带有重音的字符。

We can see here that the result is a larger number than what we consider the string length to be. A lot of characters are comprised of multiple bytes. There are only so many characters we can make from an eight number byte, this is what ASCII is. To support all characters of all languages in the world in 256 possible different bytes wouldn't fit. Let's try another approach.

我们可以看到,这个结果是一个比我们认为的字符串长度更大的数字。许多字符由多个字节组成。我们可以从一个八位字节中生成的字符数量是有限的,这就是ASCII的含义。要在256个可...

开通本站会员,查看完整译文。

Главная - Вики-сайт
Copyright © 2011-2025 iteam. Current version is 2.148.0. UTC+08:00, 2025-11-17 06:41
浙ICP备14020137号-1 $Гость$