代码范围。深入了解Ruby字符串
Contributing to any of the Ruby implementations can be a daunting task. A lot of internal functionality has evolved over the years or been ported from one implementation to another, and much of it is undocumented. This post is an informal look at what makes encoding-aware strings in Ruby functional and performant. I hope it'll help you get started digging into Ruby on your own or provide some additional insight into all the wonderful things the Ruby VM does for you.
对任何一个Ruby实现的贡献都是一项艰巨的任务。很多内部功能经过多年的发展,或者从一个实现移植到另一个实现,而且很多都是没有文档的。这篇文章是对Ruby中编码感知字符串的功能和性能的非正式考察。我希望它能帮助你开始挖掘你自己的Ruby,或者为Ruby VM为你做的所有美妙的事情提供一些额外的见解。
Ruby has an incredibly flexible, if not unusual, string representation. Ruby strings are generally mutable, although the core library has both immutable and mutable variants of many operations. There’s also a mechanism for freezing strings that makes String objects immutable on a per-object or per-file basis. If a string literal is frozen, the VM will use an interned version of the string. Additionally, strings in Ruby are encoding-aware, and Ruby ships with 100+ encodings that can be applied to any string, which is in sharp contrast to other languages that use one universal encoding for all its strings or prevent the construction of invalid strings.
Ruby有一个令人难以置信的灵活的,甚至是不寻常的字符串表示。Ruby字符串通常是可变的,尽管核心库中的许多操作都有不可变和可变的变体。还有一种冻结字符串的机制,使字符串对象在每个对象或每个文件上都是不可变的。如果一个字符串字头被冻结,虚拟机将使用该字符串的内部版本。此外,Ruby中的字符串是编码感知的,Ruby提供了100多种可以应用于任何字符串的编码,这与其他语言形成了鲜明的对比,这些语言对其所有的字符串使用一种通用编码,或者防止构建无效的字符串。
Depending on the context, different encodings are applied when creating a string without an explicit encoding. By default, the three primary ones used are UTF-8, US-ASCII, and ASCII-8BIT (aliased as BINARY). The encoding associated with a string can be changed with or without validation. It is possible to create a string with an underlying byte sequence that is invalid in the associated encoding.
根据上下文,在创建一个没有明确编码的字符串时,会应用不同的编码方式。默认情况下,使用的三个主要编码是UTF-8、US-ASCII和ASCII-8BIT(别名为BINARY)。与一个字符串相关的编码可以通过或不通过验证来改变。有可能创建一个在相关编码中无效的基础字节序列的字符串。
...