String_Class

Namespace

Ancestors

Comparable
Data
Object

A U::String is a sequence of zero or more Unicode characters encoded as UTF-8. It’s interface is an extension of that of Ruby’s built-in String class that provides better Unicode support, as it handles things such as casing, width, collation, and various other Unicode properties that Ruby’s built-in String class simply doesn’t bother itself with. It also provides “backwards compatibility” with Ruby 1.8.7 so that you can use Unicode without upgrading to Ruby 2.0 (which you probably should do, though).

It differs from Ruby’s built-in String class in one other very important way in that it doesn’t provide any way to change an existing object. That is, a U::String is a value object.

A U::String is most easily created from a String by calling String#u. Most U::String methods that return a stringy result will return a U::String, so you only have to do that once. You can get back a String by calling U::String#to_str.

Validation of a U::String’s content isn’t performed until any access to it is made, at which time an ArgumentError will be raised if it isn’t valid.

U::String has a lot of methods defined upon it, so let’s break them up into categories to get a proper overview of what’s possible to do with one. Let’s begin with the interrogators. There are three kinds of interrogators, validity-checking ones, property-checking ones, and content-matching ones.

The validity-checking interrogator is #valid_encoding?, which makes sure that the UTF-8 sequence itself is valid.

The property-checking interrogators are #alnum?, #alpha?, #ascii_only?, #assigned?, #case_ignorable?, #cased?, #cntrl?, #defined?, #digit?, #graph?, #newline?, #print?, #punct?, #soft_dotted?, #space?, #title?, #valid?, #wide?, #wide_cjk?, #xdigit?, and #zero_width?. These interrogators check the corresponding Unicode property of each characters in the U::String and if all characters have this property, they’ll return true.

Very close relatives to the property-checking interrogators are #folded?, #lower?, and #upper?, which check whether a string has been cased in a given way, and #normalized?, which checks whether the receiver has been normalized, optionally to a specific normalization form.

The content-matching interrogators are #==, #===, #=~, #match, #empty?, #end_with?, #eql?, #include?, #index, #rindex, and #start_with?. These interrogators check that a substring of the U::String matches another string or Regexp and either return a Boolean result, and index into the U::String where the match begins or MatchData for full matching information.

Related to the content-matching interrogators are #<=>, #casecmp, and #collation_key, all of which compare a U::String against another for ordering.

Related to the property-checking interrogators are #canonical_combining_class, #general_category, #grapheme_break, #line_break, #script, and #word_break, which return the value of the Unicode property in question, the general category being the one often interrogated.

There are a couple of other “interrogators” in #bytesize, #length, #size, #width that return integer properties of the U::String as a whole, where #length and #width are probably the most useful.

Beyond interrogators there are quite a few methods for iterating over the content of a U::String, each viewing it in its own way: #each_byte, #each_char, #each_codepoint, #each_grapheme_cluster, #each_line, and #each_word. They all have respective methods (#bytes, #chars, #codepoints, #grapheme_clusters, #lines, #words) that return an Array instead of yielding each result.

Quite a few methods are devoted to extracting a substring of a U::String, namely #[], #slice, #byteslice, #chomp, #chop, #chr, #getbyte, #lstrip, #ord, #rstrip, #strip.

There are a few methods for case-shifting: #downcase, #foldcase, #titlecase, and #upcase. Then there’s #mirror, #normalize, and #reverse that alter the string in other ways.

The methods #center, #ljust, and #rjust pad a U::String to make it a certain number of cells wide.

Then there’s a couple of methods that are more related in the arguments they take than in function: #count, #delete, #squeeze, #tr, and #tr_s. These methods all take specifications of character/code point ranges that should be counted, deleted, squeezed, and translated (plus squeezed).

Deconstructing a U::String can be done with #partition and #rpartition, which splits it around a divider, #scan, which extracts matches to a pattern, #split, which splits it on a divider.

Substitution of all matches to a pattern can be made with #gsub and of the first match to a pattern with #sub.

Creating larger U::Strings from smaller ones is done with #+, which concatenates two of them, and #*, which concatenates a U::String to itself a number of times.

A U::String can also be used as a specification as to how to format a number of values via #% (and its alias #format) into a new U::String, much like snprintf(3) in C.

The content of a U::String can be #dumped and #inspected to make it reader-friendly, but also debugger-friendly.

Finally, a U::String has a few methods to turn its content into other values: #hash, which turns it into a hash value to be used for hashing, #hex, #oct, #to_i, which turn it into a Integer, #to_str, #to_s, #b, which turn it into a String, and #to_sym (and its alias #intern), which turns it into a Symbol.

Note that some methods defined on String are missing. #Capitalize doesn’t exist, as capitalization isn’t a Unicode concept. #Sum doesn’t exist, as a U::String generally doesn’t contain content that you need a checksum of. #Crypt doesn’t exist for similar reasons. #Swapcase isn’t useful on a String and it certainly isn’t useful in a Unicode context. As a U::String doesn’t contain arbitrary data, #unpack is left to String. #Next/#succ would perhaps be implementable, but haven’t, as a satisfactory implementation hasn’t been thought of.

Constructor

initialize(string_String^? = `nil`)#⚙

Sets up a U::String wrapping string after encoding it as UTF-8 and freezing it.

Instance Methods

%
*
+
<=>
==
===
=~
[]₁
[]₂
[]₃
[]₄
[]₅
[]₆
alnum?
alpha?
ascii_only?
assigned?
b
bytes
bytesize
byteslice₁
byteslice₂
byteslice₃
byteslice₄
canonical_combining_class
case_ignorable?
casecmp
cased?
center
chars
chomp
chop
chr
cntrl?
codepoints
collation_key
count
defined?
delete
digit?
downcase
dump
each_byte₁
each_byte₂
each_char₁
each_char₂
each_codepoint₁
each_codepoint₂
each_grapheme_cluster₁
each_grapheme_cluster₂
each_line₁
each_line₂
each_word₁
each_word₂
empty?
end_with?
eql?
foldcase
folded?
format
general_category
getbyte
graph?
grapheme_break
grapheme_clusters₁
grapheme_clusters₂
gsub₁
gsub₂
gsub₃
gsub₄
hash
hex
include?
index
inspect
intern
length
line_break
lines
ljust
lower?
lstrip
match₁
match₂
mirror
newline?
normalize
normalized?
oct
ord
partition
print?
punct?
reverse
rindex
rjust
rpartition
rstrip
scan₁
scan₂
scan₃
scan₄
script
size
slice₁
slice₂
slice₃
slice₄
slice₅
slice₆
soft_dotted?
space?
split
squeeze
start_with?
strip
sub₁
sub₂
sub₃
title?
titlecase
to_i
to_s
to_str
to_sym
tr
tr_s
u
upcase
upper?
valid?
valid_encoding?
wide?
wide_cjk?
width
word_break
words₁
words₂
xdigit?
zero_width?

u_self#⚙

Returns the receiver; mostly for completeness, but allows you to always call #u on something that’s either a String or a U::String.

valid_encoding?_Boolean#⚙

Returns true if the receiver contains only valid UTF-8 sequences.

alnum?_Boolean#⚙

Returns true if the receiver contains only characters in the general categories Letter and Number.

alpha?_Boolean#⚙

Returns true if the receiver contains only characters in the general category Alpha.

ascii_only?_Boolean#⚙

Returns true if the receiver contains only characters in the ASCII region, that is, U+0000 through U+007F.

assigned?_Boolean#⚙

Returns true if the receiver contains only code points that have been assigned a code value.

case_ignorable?_Boolean#⚙

Returns true if the receiver contains only “case ignorable” characters, that is, characters in the general categories

Other, format (Cf)
Letter, modifier (Lm)
Mark, enclosing (Me)
Mark, nonspacing (Mn)
Symbol, modifier (Sk)

and the characters

U+0027 APOSTROPHE
U+00AD SOFT HYPHEN
U+2019 RIGHT SINGLE QUOTATION MARK.

See Also: Unicode Standard Annex #21: Case Mappings

cased?_Boolean#⚙

Returns true if the receiver only contains characters in the general categories

Letter, uppercase (Lu)
Letter, lowercase (Ll)
Letter, titlecase (Lt)

or has the derived properties Other_Uppercase or Other_Lowercase.

cntrl?_Boolean#⚙

Returns true if the receiver contains only characters in the general category Other, control (Cc).

defined?_Boolean#⚙

Returns true if the receiver contains only characters not in the general categories Other, not assigned (Cn) and Other, surrogate (Cs).

digit?_Boolean#⚙

Returns true if the receiver contains only characters in the general category Number, decimal digit (Nd).

folded?(locale_{#to_str} = `ENV[LC_CTYPE]`)_Boolean#⚙

Returns true if the receiver has been case-folded according to the rules of the language of locale, which may be empty to specifically use the default, language-independent, rules, that is, if a = a#foldcase(locale), where a = #normalize(:nfd).

graph?_Boolean#⚙

Returns true if the receiver contains only non-space “printable” characters.

Non-space “printable” character are those not in the general categories Other or Space, separator (Zs):

Other, control (Cc)
Other, format (Cf)
Other, not assigned (Cn)
Other, surrogate (Cs)
Space, separator (Zs)

lower?(locale_{#to_str} = `ENV[LC_CTYPE]`)_Boolean#⚙

Returns true if the receiver has been downcased according to the rules of the language of locale, which may be empty to specifically use the default, language-independent, rules, that is, if a = a#downcase(locale), where a = #normalize(:nfd).

newline?_Boolean#⚙

Returns true if the receiver contains only “newline” characters. A character is a “newline” character if it is any of the following characters:

U+000A (LINE FEED (LF))
U+000C (FORM FEED (FF))
U+000D (CARRIAGE RETURN (CR))
U+0085 (NEXT LINE)
U+2028 (LINE SEPARATOR)
U+2029 (PARAGRAPH SEPARATOR)

print?_Boolean#⚙

Returns true if the receiver contains only characters not in the general category Other.

punct?_Boolean#⚙

Returns true if the receiver contains only characters in the general categories Punctuation and Symbol.

soft_dotted?_Boolean#⚙

Returns true if this U::String only contains soft-dotted characters.

Note: Soft-dotted characters have the soft-dotted property and thus lose their dot if an accent is applied to them, for example, ‘i’ and ‘j’.
See Also: Unicode Public Review Issue #11

space?_Boolean#⚙

Returns true if the receiver contains only “space” characters. Space characters are those in the general category Separator:

Separator, space (Zs)
Separator, line (Zl)
Separator, paragraph (Zp)

such as ‘ ’, or a control character acting as such, namely

U+0009 CHARACTER TABULATION (HT)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+000D CARRIAGE RETURN (CR)

title?_Boolean#⚙

Returns true if the receiver contains only characters in the general category Letter, Titlecase (Lt).

upper?(locale_{#to_str} = `ENV[LC_CTYPE]`)_Boolean#⚙

Returns true if the receiver has been upcased according to the rules of the language of locale, which may be empty to specifically use the default, language-independent, rules, that is, if a = a#upcase(locale), where a = #normalize(:nfd).

valid?_Boolean#⚙

Returns true if the receiver contains only valid Unicode characters.

wide?_Boolean#⚙

Returns true if the receiver contains only “wide” characters. Wide character are those that have their East_Asian_Width property set to Wide or Fullwidth.

This is mostly useful for determining how many “cells” a character will take up on a terminal or similar cell-based display.

See Also

wide_cjk?_Boolean#⚙

Returns true if the receiver contains only “wide” and “ambiguously wide” characters. Wide and ambiguously wide character are those that have their East_Asian_Width property set to Ambiguous, Wide or Fullwidth.

This is mostly useful for determining how many “cells” a character will take up on a terminal or similar cell-based display.

See Also

xdigit?_Boolean#⚙

Returns true if the receiver contains only characters in the general category Number, decimal digit (Nd) or is a lower- or uppercase letter between ‘a’ and ‘f’. Specifically, any character that

Belongs to the general category Number, decimal digit (Nd)
Falls in the range U+0041 (LATIN CAPITAL LETTER A) through U+0046 (LATIN CAPITAL LETTER F)
Falls in the range U+0061 (LATIN SMALL LETTER A) through U+0066 (LATIN SMALL LETTER F)
Falls in the range U+FF21 (FULLWIDTH LATIN CAPITAL LETTER A) through U+FF26 (FULLWIDTH LATIN CAPITAL LETTER F)
Falls in the range U+FF41 (FULLWIDTH LATIN SMALL LETTER A) through U+FF46 (FULLWIDTH LATIN SMALL LETTER F)

will do.

zero_width?_Boolean#⚙

Returns true if the receiver contains only “zero-width” characters. A zero-width character is defined as a character in the general categories Mark, nonspacing (Mn), Mark, enclosing (Me) or Other, format (Of), excluding the character U+00AD (SOFT HYPHEN), or is a Hangul character between U+1160 and U+1200 or U+200B (ZERO WIDTH SPACE).

normalized?(mode_{#to_sym} = `:default`)_Boolean#⚙

Returns true if it can be determined that the receiver is normalized according to mode.

See #normalize for a discussion on normalization and a list of the possible normalization modes.

See Also: Unicode Standard Annex #15: Unicode Normalization Forms

`==`(other_{U::String, #to_str})_Boolean#⚙

Returns true if the receiver’s bytes equal those of other.

See Also

#<=>
#eql?

`===`(other_{U::String, #to_str})_Boolean#⚙

This is an alias for #==.

`=~`(other_{Regexp, #=~})_Numeric^?#⚙

Returns the result of other#=~(self), that is, the index of the first character of the match of other in the receiver, if one exists.

Raises_TypeError: If other is a U::String or String

match(pattern_{Regexp, #to_str}, index_{#to_int} = `0`)_MatchData^?#⚙

Returns the result of r#match(self, index), that is, the match data of the first match of r in the receiver, inheriting any taint and untrust from both the receiver and from pattern, if one exists, where r = pattern, if pattern is a Regexp, r = Regexp.new(pattern) otherwise.

match(pattern_{Regexp, #to_str}, index_{#to_int} = `0`){ |matchdata_MatchData| … }_Object^?#⚙

Returns the result of calling the given block with the result of r#match(self, index), that is, the match data of the first match of r in the receiver, inheriting any taint and untrust from both the recevier and from pattern, if one exists, where r = pattern, if pattern is a Regexp, r = Regexp.new(pattern) otherwise.

empty?_Boolean#⚙

Returns true if #bytesize = 0.

end_with?(*suffixes_Array)_Boolean#⚙

Returns true if any element of suffixes that responds to #to_str is a byte-level suffix of the receiver.

eql?(other_U::String)_Boolean#⚙

Returns true if the receiver’s bytes equal those of other.

See Also

#<=>
#==

include?(substring_{#to_str})_Boolean#⚙

Returns true if #index(substring) ≠ nil.

index(pattern_{Regexp, #to_str}, offset_{#to_int} = `0`)_Integer^?#⚙

Returns the minimal index of the receiver where pattern matches, equal to or greater than i, where i = offset if offset ≥ 0, i = #length - abs(offset) otherwise, or nil if there is no match.

If pattern is a Regexp, the Regexp special variables $&, $', $`, $1, $2, …, $n are updated accordingly.

If pattern responds to #to_str, the matching is performed by byte comparison.

See Also: #rindex

rindex(pattern_{Regexp, #to_str}, offset_{#to_int} = `-1`)_Integer^?#⚙

Returns the maximal index of the receiver where pattern matches, equal to or less than i, where i = offset if offset ≥ 0, i = #length - abs(offset) otherwise, or nil if there is no match.

If pattern is a Regexp, the Regexp special variables $&, $', $`, $1, $2, …, $n are updated accordingly.

If pattern responds to #to_str, the matching is performed by a byte comparison.

See Also: #index

start_with?(*prefixes_Array)_Boolean#⚙

Returns true if any element of prefixes that responds to #to_str is a byte-level prefix of the receiver.

`<=>`(other_{U::String, #to_str}, locale_{#to_str} = `ENV['LC_COLLATE']`)_Fixnum#⚙

Returns the comparison of the receiver and other using the linguistically correct rules of locale. The locale must be given as a language, region, and encoding, for example, “en_US.UTF-8”.

This operation is known as “collation” and you can find more information about the collation algorithm employed in the Unicode Technical Standard #10, see http://unicode.org/reports/tr10/.

Raises_{Errno::EILSEQ}

If a character in the receiver can’t be converted into the encoding of the locale

See Also

#==
#eql?

casecmp(other_{U::String, #to_str}, locale_{#to_str} = `ENV['LC_COLLATE']`)_Fixnum#⚙

Returns the comparison of #foldcase to other#foldcase using the linguistically correct rules of locale. This is, however, only an approximation of a case-insensitive comparison. The locale must be given as a language, region, and encoding, for example, “en_US.UTF-8”.

This operation is known as “collation” and you can find more information about the collation algorithm employed in the Unicode Technical Standard #10, see http://unicode.org/reports/tr10/.

collation_key(locale)_U::String#⚙

Returns the locale-dependent collation key of the receiver in locale, inheriting any taint and untrust.

Note

Use the collation key when comparing U::Strings to each other repeatedly, as occurs when, for example, sorting a list of U::Strings.
The locale must be given as a language, region, and encoding, for example, “en_US.UTF-8”.

Raises_{Errno::EILSEQ}

If a character in the receiver can’t be converted into the encoding of the locale

canonical_combining_class_Fixnum#⚙

Returns the canonical combining class of the characters of the receiver.

The canonical combining class of a character is a number in the range [0, 254]. The canonical combining class is used when generating a canonical ordering of the characters in a string.

The empty string has a canonical combining class of 0.

Raises

ArgumentError: If the receiver contains two characters belonging to different combining classes
ArgumentError: If the receiver contains an incomplete UTF-8 sequence
ArgumentError: If the receiver contains an invalid UTF-8 sequence

general_category_Symbol#⚙

Returns the general category of the characters of the receiver.

The general category identifies what kind of symbol the character is.

Category Major, minor	Unicode Value	Ruby Value
Other, control	Cc	:other_control
Other, format	Cf	:other_format
Other, not assigned	Cn	:other_not_assigned
Other, private use	Co	:other_private_use
Other, surrogate	Cs	:other_surrogate
Letter, lowercase	Ll	:letter_lowercase
Letter, modifier	Lm	:letter_modifier
Letter, other	Lo	:letter_other
Letter, titlecase	Lt	:letter_titlecase
Letter, uppercase	Lu	:letter_uppercase
Mark, spacing combining	Mc	:mark_spacing_combining
Mark, enclosing	Me	:mark_enclosing
Mark, nonspacing	Mn	:mark_non_spacing
Number, decimal digit	Nd	:number_decimal
Number, letter	Nl	:number_letter
Number, other	No	:number_other
Punctuation, connector	Pc	:punctuation_connector
Punctuation, dash	Pd	:punctuation_dash
Punctuation, close	Pe	:punctuation_close
Punctuation, final quote	Pf	:punctuation_final_quote
Punctuation, initial quote	Pi	:punctuation_initial_quote
Punctuation, other	Po	:punctuation_other
Punctuation, open	Ps	:punctuation_open
Symbol, currency	Sc	:symbol_currency
Symbol, modifier	Sk	:symbol_modifier
Symbol, math	Sm	:symbol_math
Symbol, other	So	:symbol_other
Separator, line	Zl	:separator_line
Separator, paragraph	Zp	:separator_paragraph
Separator, space	Zs	:separator_space

Raises

ArgumentError: If the receiver contains two characters belonging to different general categories
ArgumentError: If the receiver contains an incomplete UTF-8 sequence
ArgumentError: If the receiver contains an invalid UTF-8 sequence

grapheme_break_Symbol#⚙

Returns the grapheme break property value of the characters of the receiver.

The possible break values are

:control
:cr
:extend
:l
:lf
:lv
:lvt
:other
:prepend
:regional_indicator
:spacingmark
:t
:v

Raises_{ArgumentError}: If the string consists of more than one break type
See Also: Unicode Standard Annex #29: Unicode Text Segmentation

line_break_Symbol#⚙

Returns the line break property value of the characters of the receiver.

The possible break values are

:after
:alphabetic
:ambiguous
:before
:before_and_after
:carriage_return
:close_parenthesis
:close_punctuation
:combining_mark
:complex_context
:conditional_japanese_starter
:contingent
:exclamation
:hangul_l_jamo
:hangul_lv_syllable
:hangul_lvt_syllable
:hangul_t_jamo
:hangul_v_jamo
:hebrew_letter
:hyphen
:ideographic
:infix_separator
:inseparable
:line_feed
:mandatory
:next_line
:non_breaking_glue
:non_starter
:numeric
:open_punctuation
:postfix
:prefix
:quotation
:regional_indicator
:space
:surrogate
:symbol
:unknown
:word_joiner
:zero_width_space

Raises_{ArgumentError}: If the string consists of more than one break type
See Also: Unicode Standard Annex #14: Unicode Line Breaking Algorithm

script_Symbol#⚙

Returns the script of the characters of the receiver.

The script of a character identifies the primary writing system that uses the character.

Script	Description
:arabic	Arabic
:armenian	Armenian
:avestan	Avestan
:balinese	Balinese
:bamum	Bamum
:batak	Batak
:bengali	Bengali
:bopomofo	Bopomofo
:brahmi	Brahmi
:braille	Braille
:buginese	Buginese
:buhid	Buhid
:canadian_aboriginal	Canadian Aboriginal
:carian	Carian
:chakma	Chakma
:cham	Cham
:cherokee	Cherokee
:common	For other characters that may be used with multiple scripts
:coptic	Coptic
:cuneiform	Cuneiform
:cypriot	Cypriot
:cyrillic	Cyrillic
:deseret	Deseret
:devanagari	Devanagari
:egyptian_hieroglyphs	Egyptian Hieroglpyhs
:ethiopic	Ethiopic
:georgian	Georgian
:glagolitic	Glagolitic
:gothic	Gothic
:greek	Greek
:gujarati	Gujarati
:gurmukhi	Gurmukhi
:han	Han
:hangul	Hangul
:hanunoo	Hanunoo
:hebrew	Hebrew
:hiragana	Hiragana
:imperial_aramaic	Imperial Aramaic
:inherited	For characters that may be used with multiple scripts, and that inherit their script from the preceding characters; these include nonspacing marks, enclosing marks, and the zero-width joiner/non-joiner characters
:inscriptional_pahlavi	Inscriptional Pahlavi
:inscriptional_parthian	Inscriptional Parthian
:javanese	Javanese
:kaithi	Kaithi
:kannada	Kannada
:katakana	Katakana
:kayah_li	Kayah Li
:kharoshthi	Kharoshthi
:khmer	Khmer
:lao	Lao
:latin	Latin
:lepcha	Lepcha
:limbu	Limbu
:linear_b	Linear B
:lisu	Lisu
:lycian	Lycian
:lydian	Lydian
:malayalam	Malayalam
:mandaic	Mandaic
:meetei_mayek	Meetei Mayek
:meroitic_hieroglyphs	Meroitic Hieroglyphs
:meroitic_cursive	Meroitic Cursives
:miao	Miao
:mongolian	Mongolian
:myanmar	Myanmar
:new_tai_lue	New Tai Lue
:nko	N'Ko
:ogham	Ogham
:old_italic	Old Italic
:old_persian	Old Persian
:old_south_arabian	Old South Arabian
:old_turkic	Old Turkic
:ol_chiki	Ol Chiki
:oriya	Oriya
:osmanya	Osmanya
:phags_pa	Phags-pa
:phoenician	Phoenician
:rejang	Rejang
:runic	Runic
:samaritan	Samaritan
:saurashtra	Saurashtra
:sharada	Sharada
:shavian	Shavian
:sinhala	Sinhala
:sora_sompeng	Sora Sompeng
:sundanese	Sundanese
:syloti_nagri	Syloti Nagri
:syriac	Syriac
:tagalog	Tagalog
:tagbanwa	Tagbanwa
:tai_le	Tai Le
:tai_tham	Tai Tham
:tai_viet	Tai Viet
:takri	Takri
:tamil	Tamil
:telugu	Telugu
:thaana	Thaana
:thai	Thai
:tibetan	Tibetan
:tifinagh	Tifinagh
:ugaritic	Ugaritic
:unknown	For not assigned, private-use, non-character, and surrogate code points
:vai	Vai
:yi	Yi

Raises

ArgumentError: If the receiver contains two characters belonging to different scripts
ArgumentError: If the receiver contains an incomplete UTF-8 sequence
ArgumentError: If the receiver contains an invalid UTF-8 sequence

word_break_Symbol#⚙

Returns the word break property value of the characters of the receiver.

The possible word break values are

:aletter
:cr
:extend
:extendnumlet
:format
:katakana
:lf
:midletter
:midnum
:midnumlet
:newline
:numeric
:other
:regional_indicator

Raises_{ArgumentError}: If the string consists of more than one break type
See Also: Unicode Standard Annex #29: Unicode Text Segmentation

bytesize_Integer#⚙

Returns the number of bytes required to represent the receiver.

length_Integer#⚙

Returns the number of characters in the receiver.

size_Integer#⚙

This is an alias for #length.

width_Integer#⚙

Returns the width of the receiver. The width is defined as the sum of the number of “cells” on a terminal or similar cell-based display that the characters in the string will require.

Characters that are #wide? have a width of 2. Characters that are #zero_width? have a width of 0. Other characters have a width of 1.

See Also: Unicode Standard Annex #11: East Asian Width

each_byte{ |byte_Fixnum| … }_self#⚙

Enumerates the bytes in the receiver.

each_byte_Enumerator#⚙

Returns an Enumerator over the bytes in the receiver.

bytes_{Array<Fixnum>}#⚙

Returns the bytes of the receiver.

each_char{ |char_U::String| … }_self#⚙

Enumerates the characters in the receiver, each inheriting any taint and untrust.

each_char_Enumerator#⚙

Returns an Enumerator over the characters in the receiver.

chars_{Array<U::String>}#⚙

Returns the characters of the receiver, each inheriting any taint and untrust.

each_codepoint{ |codepoint_Integer| … }_self#⚙

Enumerates the code points of the receiver.

each_codepoint_Enumerator#⚙

Returns an Enumerator over the code points of the receiver.

codepoints_{Array<Integer>}#⚙

Returns the code points of the receiver.

each_grapheme_cluster{ |cluster_U::String| … }_self#⚙

Enumerates the grapheme clusters in the receiver, each inheriting any taint and untrust.

See Also: Unicode Standard Annex #29: Unicode Text Segmentation

each_grapheme_cluster_Enumerator#⚙

Returns an Enumerator over the grapheme clusters in the receiver.

See Also: Unicode Standard Annex #29: Unicode Text Segmentation

grapheme_clusters{ |cluster_U::String| … }_self#⚙

This is an alias for #each_grapheme_cluster.

grapheme_clusters_Enumerator#⚙

This is an alias for #each_grapheme_cluster.

each_line(separator_{U::String, #to_str} = `$/`){ |lp_{U::String, self}| … }_self#⚙

Enumerates the lines of the receiver, inheriting any taint and untrust.

If separator is nil, yields self. If separator is #empty?, separates each line (paragraph) by two or more U+000A LINE FEED characters.

each_line(separator_{U::String, #to_str} = `$/`)_Enumerator#⚙

Returns an Enumerator over the lines of the receiver.

If separator is nil, self will be yielded. If separator is #empty?, separates each line (paragraph) by two or more U+000A LINE FEED characters.

lines(separator_{U::String, #to_str} = `$/`)_{Array<U::String>}#⚙

Returns the lines of the receiver, inheriting any taint and untrust.

If separator is nil, yields self. If separator is #empty?, separates each line (paragraph) by two or more U+000A LINE FEED characters.

each_word{ |word_U::String| … }_self#⚙

Enumerates the words in the receiver, each inheriting any taint and untrust.

See Also: Unicode Standard Annex #29: Unicode Text Segmentation

each_word_Enumerator#⚙

Returns an Enumerator over the characters in the receiver.

See Also: Unicode Standard Annex #29: Unicode Text Segmentation

words{ |word_U::String| … }_self#⚙

This is an alias for #each_word.

words_Enumerator#⚙

This is an alias for #each_word.

`[]`(index_{#to_int})_U::String^?#⚙

Returns the substring [max(i, 0), min(#length, i + 1)], where i = index if index ≥ 0, i = #length - abs(index) otherwise, inheriting any taint and untrust, or nil if this substring is empty.

`[]`(index_{#to_int}, length_{#to_int})_U::String^?#⚙

Returns the substring [max(i, 0), min(#length, i + length)], where i = index if index ≥ 0, i = #length - abs(index) otherwise, inheriting any taint or untrust, or nil if length < 0.

`[]`(range_Range)_U::String^?#⚙

Returns the result of #[i, j - k], where i = range#begin if range#begin ≥ 0, i = #length - abs(range#begin) otherwise, j = range#end if range#end ≥ 0, j = #length - abs(range#end) otherwise, and k = 1 if range#exclude_end?, k = 0 otherwise, or nil if j - k < 0.

`[]`(regexp_Regexp, reference_{#to_int, #to_str, Symbol} = `0`)_U::String^?#⚙

Returns the submatch reference from the first match of regexp in the receiver, inheriting any taint and untrust from both the receiver and from regexp, or nil if there is no match or if the submatch isn’t part of the overall match.

Raises_IndexError: If reference doesn’t refer to a submatch

`[]`(string_{U::String, ::String})_U::String^?#⚙

Returns the substring string, inheriting any taint and untrust from string, if string is a substring of the receiver.

`[]`(object_Object)_nil#⚙

Returns nil for any object that doesn’t satisfy the other cases.

slice(index_{#to_int})_U::String^?#⚙

This is an alias for #[].

slice(index_{#to_int}, length_{#to_int})_U::String^?#⚙

This is an alias for #[].

slice(range_Range)_U::String^?#⚙

This is an alias for #[].

slice(regexp_Regexp, reference_{#to_int, #to_str, Symbol} = `0`)_U::String^?#⚙

This is an alias for #[].

slice(string_{U::String, ::String})_U::String^?#⚙

This is an alias for #[].

slice(object_Object)_nil#⚙

This is an alias for #[].

byteslice(index_{#to_int})_U::String^?#⚙

Returns the byte-index-based substring [max(i, 0), min(#bytesize, i + 1)], where i = index if index ≥ 0, i = #bytesize - abs(index) otherwise, inheriting any taint and untrust, or nil if this substring is empty.

byteslice(index_{#to_int}, length_{#to_int})_U::String^?#⚙

Returns the byte-index-based substring [max(i, 0), min(#bytesize, i + length)], where i = index if index ≥ 0, i = #bytesize - abs(index) otherwise, inheriting any taint and untrust, or nil if length < 0.

byteslice(range_Range)_U::String^?#⚙

Returns the result of #[i, j - k], where i = range#begin if range#begin ≥ 0, i = #bytesize - abs(range#begin) otherwise, j = range#end if range#end ≥ 0, j = #bytesize - abs(range#end) otherwise, and k = 1 if range#exclude_end?, k = 0 otherwise, or nil if j - k < 0.

byteslice(object_Object)_nil#⚙

Returns nil for any object that doesn’t satisfy the other cases.

chomp(separator_{U::String, #to_str, nil} = `$/`)_{U::String, self, nil}#⚙

Returns the receiver, minus any separator suffix, inheriting any taint and untrust, unless #length = 0, in which case nil is returned. If separator is nil or invalidly encoded, the receiver is returned.

If separator is $/ and $/ has its default value or if separator is U+000A LINE FEED, the longest suffix consisting of any of

U+000A LINE FEED
U+000D CARRIAGE RETURN
U+000D CARRIAGE RETURN, U+000D LINE FEED

will be removed. If no such suffix exists and the last character is a #newline?, it will be removed instead.

If separator is #empty?, remove the longest #newline? suffix.

See Also

#chop
#lstrip
#rstrip
#strip

chop_U::String#⚙

Returns the receiver, minus its last character, inheriting any taint and untrust, unless the receiver is #empty? or if the last character is invalidly encoded, in which case the receiver is returned.

If the last character is U+000A LINE FEED and the second-to-last character is U+000D CARRIAGE RETURN, both characters are removed.

See Also

#chomp
#lstrip
#rstrip
#strip

chr_U::String#⚙

Returns the substring [0, min(#length, 1)], inheriting any taint and untrust.

getbyte(index_{#to_int})_Fixnum^?#⚙

Returns the byte at byte-index i, where i = index if index ≥ 0, i = #bytesize - abs(index) otherwise, or nil if i lays outside of [0, #bytesize].

lstrip_U::String#⚙

Returns the receiver with its maximum #space? prefix removed, inheriting any taint and untrust.

See Also

#rstrip
#strip

ord_Integer#⚙

Returns the code point of the first character of the receiver.

rstrip_U::String#⚙

Returns the receiver with its maximum #space? suffix removed, inheriting any taint and untrust from the receiver.

See Also

#lstrip
#strip

strip_U::String#⚙

Returns the receiver with its maximum #space? prefix and suffix removed, inheriting any taint and untrust.

See Also

#lstrip
#rstrip

downcase(locale_{#to_str} = `ENV['LC_CTYPE']`)_U::String#⚙

Returns the downcasing of the receiver according to the rules of the language of locale, which may be empty to specifically use the default, language-independent, rules, inheriting any taint and untrust.

foldcase(locale_{#to_str} = `ENV['LC_CTYPE']`)_U::String#⚙

Returns the case-folding of the receiver according to the rules of the language of locale, which may be empty to specifically use the default rules, inheriting any taint and untrust.

titlecase(locale_{#to_str} = `ENV['LC_CTYPE']`)_U::String#⚙

Returns the title-casing of the receiver according to the rules of the language of locale, which may be empty to specifically use the default, language-independent, rules, inheriting any taint and untrust.

upcase(locale_{#to_str} = `ENV['LC_CTYPE']`)_U::String#⚙

Returns the upcasing of the receiver according to the rules of of the language of locale, which may be empty to specifically use the default, language-independent, rules, inheriting any taint and untrust.

mirror_U::String#⚙

Returns the mirroring of the receiver, inheriting any taint and untrust.

Mirroring is done by replacing characters in the string with their horizontal mirror image, if any, in text that is laid out from right to left. For example, ‘(’ becomes ‘)’ and ‘)’ becomes ‘(’.

See Also: Unicode Standard Annex #9: Unicode Bidirectional Algorithm

normalize(form_{#to_sym} = `:nfd`)_U::String#⚙

Returns the receiver normalized into form, inheriting any taint and untrust.

Normalization is the process of converting characters and sequences of characters in string into a canonical form. This process includes dealing with whether characters are represented by a composed character or a base character and combining marks, such as accents.

The possible normalization forms are

Form	Description
`:nfd`	Normalizes characters to their maximally decomposed form, ordering accents and so on according to their combining class
`:nfc`	Normalizes according to `:nfd`, then composes any decomposed characters
`:nfkd`	Normalizes according to `:nfd` and also normalizes “compatibility” characters, such as replacing U+00B3 SUPERSCRIPT THREE with U+0033 DIGIT THREE
`:nfkc`	Normalizes according to `:nfkd`, then composes any decomposed characters

See Also: Unicode Standard Annex #15: Unicode Normalization Forms

reverse_U::String#⚙

Returns the reversal of the receiver, inheriting any taint and untrust from the receiver.

Note: This doesn’t take into account proper handling of combining marks, direction indicators, and similarly relevant characters, so this method is mostly useful when you know the contents of the string is simple and the result isn’t intended for display.

center(width_{#to_int}, padding_{U::String, #to_str} = `' '`)_U::String#⚙

Returns the receiver padded as evenly as possible on both sides with padding to make it max(#length, width) wide, inheriting any taint and untrust from the receiver and also from padding if padding is used.

Raises

ArgumentError: If padding#width = 0
ArgumentError: If characters inside padding that should be used for round-off padding are too wide

See Also

#ljust
#rjust

ljust(width_{#to_int}, padding_{U::String, #to_str} = `' '`)_U::String#⚙

Returns the receiver padded on the right with padding to make it max(#length, width) wide, inheriting any taint and untrust from the receiver and also from padding if padding is used.

Raises

ArgumentError: If padding#width = 0
ArgumentError: If characters inside padding that should be used for round-off padding are too wide

See Also

#center
#rjust

rjust(width_{#to_int}, padding_{U::String, #to_str} = `' '`)_U::String#⚙

Returns the receiver padded on the left with padding to make it max(#length, width) wide, inheriting any taint and untrust from the receiver and also from padding if padding is used.

Raises

ArgumentError: If padding#width = 0
ArgumentError: If characters inside padding that should be used for round-off padding are too wide

See Also

#center
#ljust

count(set_{U::String, #to_str}, *sets_{Array<U::String, #to_str>})_Integer#⚙

Returns the number of characters in the receiver that are included in the intersection of set and any additional sets of characters.

The complement of all Unicode characters and a given set of characters may be specified by prefixing a non-empty set with ‘^’ (U+005E CIRCUMFLEX ACCENT).

Any sequence of characters a-b inside a set will expand to also include all characters whose code points lay between those of a and b.

delete(set_{U::String, #to_str}, *sets_{Array<U::String, #to_str>})_U::String#⚙

Returns the receiver, minus any characters that are included in the intersection of set and any additional sets of characters, inheriting any taint and untrust.

The complement of all Unicode characters and a given set of characters may be specified by prefixing a non-empty set with ‘^’ (U+005E CIRCUMFLEX ACCENT).

Any sequence of characters a-b inside a set will expand to also include all characters whose code points lay between those of a and b.

squeeze(*sets_{Array<U::String, #to_str>})_U::String#⚙

Returns the receiver, replacing any substrings of #length > 1 consisting of the same character c with c, where c is a member of the intersection of the character sets in sets, inheriting any taint and untrust.

If sets is empty, then the set of all Unicode characters is used.

The complement of all Unicode characters and a given set of characters may be specified by prefixing a non-empty set with ‘^’ (U+005E CIRCUMFLEX ACCENT).

Any sequence of characters a-b inside a set will expand to also include all characters whose code points lay between those of a and b.

tr(from_{#to_str}, to_{#to_str})_U::String#⚙

Returns the receiver, translating characters in from to their equivalent character, by index, in to, inheriting any taint and untrust. If to#length < from#length, to[-1] will be used for any index i > to#length.

The complement of all Unicode characters and a given set of characters may be specified by prefixing a non-empty set with ‘^’ (U+005E CIRCUMFLEX ACCENT).

Any sequence of characters a-b inside a set will expand to also include all characters whose code points lay between those of a and b.

tr_s(from_{#to_str}, to_{#to_str})_U::String#⚙

Returns the receiver, translating characters in from to their equivalent character, by index, in to and then squeezing any substrings of #length > 1 consisting of the same character c with c, inheriting any taint and untrust. If to#length < from#length, to[-1] will be used for any index i > to#length.

The complement of all Unicode characters and a given set of characters may be specified by prefixing a non-empty set with ‘^’ (U+005E CIRCUMFLEX ACCENT).

Any sequence of characters a-b inside a set will expand to also include all characters whose code points lay between those of a and b.

partition(separator_{Regexp, #to_str})_{Array<U::String>}#⚙

Returns the receiver split into s₁ = #slice(0, i), s₂ = #slice(i, n), s₃ = #slice(i+n, -1), where i = j if j ≠ nil, i = #length otherwise, j = #index(separator), n = separator#length, where s₁ and s₃ inherit any taint and untrust from the receiver and s₂ inherits any taint and untrust from separator and also from the receiver if separator is a Regexp.

See Also: #rpartition

rpartition(separator_{Regexp, #to_str})_{Array<U::String>}#⚙

Returns the receiver split into s₁ = #slice(0, i), s₂ = #slice(i, n), s₃ = #slice(i + n, -1), where i = j if j ≠ nil, i = 0 otherwise, j = #rindex(separator), n = separator#length, where s₁ and s₃ inherit any taint and untrust from the receiver and s₂ inherits any taint and untrust from separator and also from the receiver if separator is a Regexp.

See Also: #partition

scan(pattern_Regexp)_{Array<U::String>⁺}#⚙

Returns all matches – or sub-matches, if they exist – of matches of pattern in the receiver, each inheriting any taint and untrust from both the receiver and from pattern.

Note: The Regexp special variables $&, $', $`, $1, $2, …, $n are updated accordingly.

scan(pattern_{#to_str})_{Array<U::String>}#⚙

Returns all matches of pattern in the receiver, each inheriting any taint and untrust from the receiver.

scan(pattern_Regexp){ |submatches_{Array<U::String>}| … }_self#⚙

Enumerates the sub-matches of matches of pattern in the receiver, each inheriting any taint and untrust from both the receiver and from pattern.

Note: The Regexp special variables $&, $', $`, $1, $2, …, $n are updated accordingly.

scan(pattern_{#to_str}){ |match_U::String| … }_self#⚙

Enumerates the matches of pattern in the receiver, each inheriting any taint and untrust from the receiver.

split(pattern_{Regexp, #to_str} = `$;`, limit_{#to_int} = `0`)_{Array<U::String>}#⚙

Returns the receiver split into limit substrings separated by pattern, each inheriting any taint and untrust.

If pattern = $; = nil or pattern = ' ', splits according to AWK rules, that is, any #space? prefix is skipped, then substrings are separated by non-empty #space? substrings.

If limit < 0, then no limit is imposed and trailing #empty? substrings aren’t removed.

If limit = 0, then no limit is imposed and trailing #empty? substrings are removed.

If limit = 1, then, if #length = 0, the result will be empty, otherwise it will consist of the receiver only.

If limit > 1, then the receiver is split into at most limit substrings.

gsub(pattern_{Regexp, #to_str}, replacement_{#to_str})_U::String#⚙

Returns the receiver with all matches of pattern replaced by replacement, inheriting any taint and untrust from the receiver and from replacement.

The replacement is used as a specification for what to replace matches with:

Specification	Replacement
`\1`, `\2`, …, `\`n	Numbered sub-match n
`\k<`name`>`	Named sub-match name

The Regexp special variables $&, $', $`, $1, $2, …, $n are updated accordingly.

gsub(pattern_{Regexp, #to_str}, replacements_{#to_hash})_U::String#⚙

Returns the receiver with all matches of pattern replaced by replacements#[match], where match is the matched substring, inheriting any taint and untrust from the receiver and from the replacements#[match]es, as well as any taint on replacements.

The Regexp special variables $&, $', $`, $1, $2, …, $n are updated accordingly.

Raises

RuntimeError: If any replacement is the result being constructed
Exception: Any error raised by replacements#default, if it gets called

gsub(pattern_{Regexp, #to_str}){ |match_U::String|_{#to_str} … }_U::String#⚙

Returns the receiver with all matches of pattern replaced by the results of the given block, inheriting any taint and untrust from the receiver and from the results of the given block.

The Regexp special variables $&, $', $`, $1, $2, …, $n are updated accordingly.

gsub(pattern_{Regexp, #to_str})_Enumerator#⚙

Returns an Enumerator over the matches of pattern in the receiver.

The Regexp special variables $&, $', $`, $1, $2, …, $n will be updated accordingly.

sub(pattern_{Regexp, #to_str}, replacement_{#to_str})_U::String^?#⚙

Returns the receiver with the first match of pattern replaced by replacement, inheriting any taint and untrust from the receiver and from replacement, or nil if there’s no match.

The replacement is used as a specification for what to replace matches with:

Specification	Replacement
`\1`, `\2`, …, `\`n	Numbered sub-match n
`\k<`name`>`	Named sub-match name

The Regexp special variables $&, $', $`, $1, $2, …, $n are updated accordingly.

sub(pattern_{Regexp, #to_str}, replacements_{#to_hash})_U::String^?#⚙

Returns the receiver with the first match of pattern replaced by replacements#[match], where match is the matched substring, inheriting any taint and untrust from the receiver, replacements, and replacements#[match], or nil if there’s no match.

The Regexp special variables $&, $', $`, $1, $2, …, $n are updated accordingly.

Raises_Exception: Any error raised by replacements#default, if it gets called

sub(pattern_{Regexp, #to_str}){ |match_U::String|_{#to_str} … }_U::String^?#⚙

Returns the receiver with all instances of pattern replaced by the results of the given block, inheriting any taint and untrust from the receiver and from the results of the given block, or nil if there’s no match.

The Regexp special variables $&, $', $`, $1, $2, …, $n are updated accordingly.

`+`(other_{U::String, #to_str})_U::String#⚙

Returns the concatenation of other to the receiver, inheriting any taint on either.

Raises_{ArgumentError}: If #bytesize + other#bytesize > LONG_MAX

`*`(n_{#to_int})_U::String#⚙

Returns the concatenation of n copies of the receiver, inheriting any taint and untrust.

Raises

ArgumentError: If n < 0
ArgumentError: If n > 0 and n × #bytesize > LONG_MAX

`%`(value)_U::String#⚙

Returns a formatted string of the values in Array(value) by treating the receiver as a format specification of this formatted string.

A format specification is a string consisting of sequences of normal characters that are copied verbatim and field specifiers. A field specifier consists of a %, followed by any optional flags, an optional width, an optional precision, and a directive:

%[flags][width][.[precision]]directive

Note that this means that a lone % at the end of the string is simply copied verbatim as it, by this definition, isn’t a field directive.

The directive determines how this field should be formatted. The flags, width, and precision modify this interpretation.

The field often takes a value from value and formats it according to a given set of rules, which depend on the flags, width, and precision, but can also output other, hardwired, values.

The directives that don’t take a value are

Directive	Description
%	Outputs ‘%’.
\n	Outputs “%\n”.
\0	Outputs “%\0”.

None of these directives take any flags, width, or precision.

All of the following directives allow you to specify a width. The width only ever limits the minimum width of the field, that is, at least width cells will be filled by the field, but perhaps more will actually be required in the end.

c

Outputs

[left-padding]character[right-padding]

If a width w has been specified and the ‘-’ flag hasn’t been given, left-padding consists of enough spaces to make the whole field at least w cells wide, otherwise it’s empty.

Character is the result of #to_str#chr on the argument, if it responds to #to_str, otherwise it’s the result of #to_int turned into a string containing the character at that code point. A precision isn’t allowed. The #width of the character is used in any width calculations.

If a width w has been specified and the ‘-’ flag has been given, right-padding consists of enough spaces to make the whole field at least w cells wide, otherwise it’s empty.

s

Outputs

[left-padding]string[right-padding]

Left-padding and right-padding are the same as for the ‘c’ directive described above.

String is a substring of the result of #to_s on the argument that is w cells wide, where w = precision, if a precision has been specified, w = #width otherwise.

p

Outputs

[left-padding]inspect[right-padding]

Left-padding and right-padding are the same as for the ‘c’ directive described above.

String is a substring of the result of #inspect on the argument that is w cells wide, where w = precision, if a precision has been specified, w = #width otherwise.

d

i

u

Outputs

[left-padding][prefix/sign][zeroes]
   [precision-filler]digits[right-padding]

If a width w has been specified and neither the ‘-’ nor the ‘0’ flag has been given, left-padding consists of enough spaces to make the whole field at least w cells wide, otherwise it’s empty.

Prefix/sign is “-” if the argument is negative, “+” if the ‘+’ flag was given, and “ ” if the ‘ ’ flag was given, otherwise it’s empty.

If a width w has been specified and the ‘0’ flag has been given and neither the ‘-’ flag has been given nor a precision has been specified, zeroes consists of enough zeroes to make the whole field at least w cells wide, otherwise it’s empty.

If a precision p has been specified, precision-filler consists of enough zeroes to make for p digits of output, otherwise it’s empty.

Digits consists of the digits in base 10 that represent the result of calling Integer with the argument as its argument.

If a width w has been specified and the ‘-’ flag has been given, right-padding consists of enough spaces to make the whole field at least w cells wide, otherwise it’s empty.

Flag	Description
(Space)	Add a “ ” prefix to non-negative numbers
`+`	Add a “+” sign to non-negative numbers; overrides the ‘ ’ flag
`0`	Use ‘0’ for any width padding; ignored when a precision has been specified
`-`	Left justify the output with ‘ ’ as padding; overrides the ‘`0`’ flag

o

Outputs

[left-padding][prefix/sign][zeroes/sevens]
   [precision-filler]octal-digits[right-padding]

Prefix/sign is “-” if the argument is negative and the ‘+’ or ‘ ’ flag was given, “..” if the argument is negative, “+” if the ‘+’ flag was given, and “ ” if the ‘ ’ flag was given, otherwise it’s empty.

If a width w has been specified and the ‘0’ flag has been given and neither the ‘-’ flag has been given nor a precision has been specified, zeroes/sevens consists of enough zeroes, if the argument is non-negative or if the ‘+’ or ‘ ’ flag has been specified, sevens otherwise, to make the whole field at least w cells wide, otherwise it’s empty.

If a precision p has been specified, precision-filler consists of enough zeroes, if the argument is non-negative or if the ‘+’ or ‘ ’ flag has been specified, sevens otherwise, to make for p digits of output, otherwise it’s empty.

Octal-digits consists of the digits in base 8 that represent the result of #to_int on the argument, using ‘0’ through ‘7’. A negative value will be output as a two’s complement value.

If a width w has been specified and the ‘-’ flag has been given, right-padding consists of enough spaces to make the whole field at least w cells wide, otherwise it’s empty.

Flag	Description
(Space)	Add a “ ” prefix to non-negative numbers and don’t output negative numbers as two’s complement values
`+`	Add a “+” sign to non-negative numbers and don’t output negative numbers as two’s complement values; overrides the ‘ ’ flag
`0`	Use ‘0’ for any width padding; ignored when a precision has been specified
`-`	Left justify the output with ‘ ’ as padding; overrides the ‘`0`’ flag
`#`	Increase precision to include as many digits as necessary to make the first digit ‘0’, but don’t include the ‘0’ itself

x

Outputs

[left-padding][sign][base-prefix][prefix][zeroes/fs]
   [precision-filler]hexadecimal-digits[right-padding]

Left-padding and right-padding are the same as for the ‘o’ directive described above. Zeroes/fs is the same as zeroes/sevens for the ‘o’ directive, except that it uses ‘f’ characters instead of sevens. The same goes for precision-filler.

Sign is “-” if the argument is negative and the ‘+’ or ‘ ’ flag was given, “+” if the argument is non-negative and the ‘+’ flag was given, and “ ” if the argument is non-negative and the ‘ ’ flag was given, otherwise it’s empty.

Base-prefix is “0x” if the ‘#’ flag was given and the result of #to_int on the argument is non-zero.

Prefix is “..” if the argument is negative and neither the ‘+’ nor the ‘ ’ flag was given.

Hexadecimal-digits consists of the digits in base 16 that represent the result of #to_int on the argument, using ‘0’ through ‘9’ and ‘a’ through ‘f’. A negative value will be output as a two’s complement value.

Flag	Description
(Space)	Same as for ‘o’
`+`	Same as for ‘o’
`0`	Same as for ‘o’
`-`	Same as for ‘o’
`#`	Prefix non-zero values with “0x”

X

Same as ‘x’, except that it uses uppercase letters instead.

b

Outputs

[left-padding][sign][base-prefix][prefix][zeroes/ones]
   [precision-filler]binary-digits[right-padding]

Left-padding and right-padding are the same as for the ‘o’ directive described above. Base-prefix and prefix are the same as for the ‘x’ directive, except that base-prefix outputs “0b”. Zeroes/ones is the same as zeroes/fs for the ‘x’ directive, except that it uses ones instead of sevens. The same goes for precision-filler.

Binary-digits consists of the digits in base 2 that represent the result of #to_int on the argument, using ‘0’ and ‘1’. A negative value will be output as a two’s complement value.

Flag	Description
(Space)	Same as for ‘o’
`+`	Same as for ‘o’
`0`	Same as for ‘o’
`-`	Same as for ‘o’
`#`	Prefix non-zero values with “0b”

B

Same as ‘b’, except that it uses a “0B” prefix for the ‘#’ flag.

f

Outputs

[left-padding][prefix/sign][zeroes]
   integer-part[decimal-point][fractional-part][right-padding]

Prefix/sign is “-” if the argument is negative, “+” if the ‘+’ flag was given, and “ ” if the ‘ ’ flag was given, otherwise it’s empty.

If a width w has been specified and the ‘0’ flag has been given and the ‘-’ flag has not been given, zeroes consists of enough zeroes to make the whole field at least w cells wide, otherwise it’s empty.

Integer-part consists of the digits in base 10 that represent the integer part of the result of calling Float with the argument as its argument.

Decimal-point is “.” if the precision isn’t 0 or if the ‘#’ flag has been given.

Fractional-part consists of p digits in base 10 that represent the fractional part of the result of calling Float with the argument as its argument, where p = precision, if one has been specified, p = 6 otherwise.

If a width w has been specified and the ‘-’ flag has been given, right-padding consists of enough spaces to make the whole field at least w cells wide, otherwise it’s empty.

Flag	Description
(Space)	Add a “ ” prefix to non-negative numbers
`+`	Add a “+” sign to non-negative numbers; overrides the ‘ ’ flag
`0`	Use ‘0’ for any width padding; ignored when a precision has been specified
`-`	Left justify the output with ‘ ’ as padding; overrides the ‘`0`’ flag
#	Output a decimal point, even if no fractional part follows

e

Outputs

[left-padding][prefix/sign][zeroes]
   digit[decimal-point][fractional-part]exponent[right-padding]

If a width w has been specified and neither the ‘-’ nor the ‘0’ flag has been given, left-padding consists of enough spaces to make the whole field at least w + e cells wide, where e ≥ 4 is the width of the exponent, otherwise it’s empty.

Prefix/sign is “-” if the argument is negative, “+” if the ‘+’ flag was given, and “ ” if the ‘ ’ flag was given, otherwise it’s empty.

If a width w has been specified and the ‘0’ flag has been given and the ‘-’ flag has not been given, zeroes consists of enough zeroes to make the whole field w + e cells wide, where e ≥ 4 is the width of the exponent, otherwise it’s empty.

Digit consists of one digit in base 10 that represent the most significant digit of the result of calling Float with the argument as its argument.

Decimal-point is “.” if the precision isn’t 0 or if the ‘#’ flag has been given.

Fractional-part consists of p digits in base 10 that represent all but the most significant digit of the result of calling Float with the argument as its argument, where p = precision, if one has been specified, p = 6 otherwise.

Exponent consists of “e” followed by the exponent in base 10 required to turn the result of calling Float with the argument as its argument into a decimal fraction with one non-zero digit in the integer part. If the exponent is 0, “+00” will be output.

If a width w has been specified and the ‘-’ flag has been given, right-padding consists of enough spaces to make the whole field at least w + e cells wide, where e ≥ 4 is the width of the exponent, otherwise it’s empty.

Flag	Description
(Space)	Add a “ ” prefix to non-negative numbers
`+`	Add a “+” sign to non-negative numbers; overrides the ‘ ’ flag
`0`	Use ‘0’ for any width padding; ignored when a precision has been specified
`-`	Left justify the output with ‘ ’ as padding; overrides the ‘`0`’ flag
#	Output a decimal point, even if no fractional part follows

E

Same as ‘e’, except that it uses an uppercase ‘E’ for the exponent separator.

g

Same as ‘e’ if the exponent is less than -4 or if the exponent is greater than or equal to the precision, otherwise ‘f’ is used. The precision defaults to 6 and a precision of 0 is treated as a precision of 1. Trailing zeros are removed from the fractional part of the result.

G

Same as ‘g’, except that it uses an uppercase ‘E’ for the exponent separator.

a

Outputs

[left-padding][prefix/sign][zeroes]
   digit[hexadecimal-point][fractional-part]exponent[right-padding]

If a width w has been specified and neither the ‘-’ nor the ‘0’ flag has been given, left-padding consists of enough spaces to make the whole field at least w + e cells wide, where e ≥ 3 is the width of the exponent, otherwise it’s empty.

Prefix/sign is “-” if the argument is negative, “+” if the ‘+’ flag was given, and “ ” if the ‘ ’ flag was given, otherwise it’s empty.

If a width w has been specified and the ‘0’ flag has been given and the ‘-’ flag has not been given, zeroes consists of enough zeroes to make the whole field w + e cells wide, where e ≥ 3 is the width of the exponent, otherwise it’s empty.

Digit consists of one digit in base 16 that represent the most significant digit of the result of calling Float with the argument as its argument, using ‘0’ through ‘9’ and ‘a’ through ‘f’.

Decimal-point is “.” if the precision isn’t 0 or if the ‘#’ flag has been given.

Fractional-part consists of p digits in base 16 that represent all but the most significant digit of the result of calling Float with the argument as its argument, where p = precision, if one has been specified, p = q, where q is the number of digits required to represent the number exactly, otherwise. Digits are output using ‘0’ through ‘9’ and ‘a’ through ‘f’.

Exponent consists of “p” followed by the exponent of 2 in base 10 required to turn the result of calling Float with the argument as its argument into a decimal fraction with one non-zero digit in the integer part. If the exponent is 0, “+0” will be output.

If a width w has been specified and the ‘-’ flag has been given, right-padding consists of enough spaces to make the whole field at least w + e cells wide, where e ≥ 3 is the width of the exponent, otherwise it’s empty.

Flag	Description
(Space)	Add a “ ” prefix to non-negative numbers
`+`	Add a “+” sign to non-negative numbers; overrides the ‘ ’ flag
`0`	Use ‘0’ for any width padding; ignored when a precision has been specified
`-`	Left justify the output with ‘ ’ as padding; overrides the ‘`0`’ flag
#	Output a decimal point, even if no fractional part follows

A

Same as ‘a’, except that it uses an uppercase letters instead.

A warning is issued if the ‘0’ flag is given when the ‘-’ flag has also been given to the ‘d’, ‘i’, ‘u’, ‘o’, ‘x’, ‘X’, ‘b’, or ‘B’ directives.

A warning is issued if the ‘0’ flag is given when a precision has been specified for the ‘d’, ‘i’, ‘u’, ‘o’, ‘x’, ‘X’, ‘b’, or ‘B’ directives.

A warning is issued if the ‘ ’ flag is given when the ‘+’ flag has also been given to the ‘d’, ‘i’, ‘u’, ‘o’, ‘x’, ‘X’, ‘b’, or ‘B’ directives.

A warning is issued if the ‘0’ flag is given when the ‘o’, ‘x’, ‘X’, ‘b’, or ‘B’ directives has been given a negative argument.

A warning is issued if the ‘#’ flag is given when the ‘o’ directive has been given a negative argument.

Any taint on the receiver and any taint on arguments to any ‘s’ and ‘p’ directives is inherited by the result.

Raises

ArgumentError: If the receiver isn’t a valid format specification
ArgumentError: If any flags are given to the ‘%’, ‘\n’, or ‘\0’ directives
ArgumentError: If an argument is given to the ‘%’, ‘\n’, or ‘\0’ directives
ArgumentError: If a width is specified for the ‘%’, ‘\n’, or ‘\0’ directives
ArgumentError: If a precision is specified for the ‘%’, ‘\n’, ‘\0’, or ‘c’ directives
ArgumentError: If any of the flags ‘ ’, ‘+’, ’0’, or ‘#’ are given to the ‘c’, ‘s’, or ‘p’ directives
ArgumentError: If the ‘#’ flag is given to the ‘d’, ‘i’, or ‘u’ directives
ArgumentError: If the argument to the ‘c’ directive doesn’t respond to #to_str or #to_int

format(value)_U::String#⚙

This is an alias for #%.

dump_U::String#⚙

Returns the receiver in a reader-friendly format, inheriting any taint and untrust.

The reader-friendly format looks like “"…".u”. Inside the “…”, any #print? characters in the ASCII range are output as-is, the following special characters are escaped according to the following table:

Character	Dumped Sequence
U+0022 QUOTATION MARK	`\"`
U+005C REVERSE SOLIDUS	`\\`
U+000A LINE FEED (LF)	`\n`
U+000D CARRIAGE RETURN (CR)	`\r`
U+0009 CHARACTER TABULATION	`\t`
U+000C FORM FEED (FF)	`\f`
U+000B LINE TABULATION	`\v`
U+0008 BACKSPACE	`\b`
U+0007 BELL	`\a`
U+001B ESCAPE	`\e`

the following special sequences are also escaped:

Character	Dumped Sequence
`#$`	`\#$`
`#@`	`\#@`
`#{`	`\#{`

any valid UTF-8 byte sequences are output as “\u{n}”, where n is the lowercase hexadecimal representation of the code point encoded by the UTF-8 sequence, and any other byte is output as “\xn”, where n is the two-digit uppercase hexadecimal representation of the byte’s value.

inspect_String#⚙

Returns the receiver in a reader-friendly inspectable format, inheriting any taint and untrust, encoded using UTF-8.

The reader-friendly inspectable format looks like “"…".u”. Inside the “…”, any #print? characters are output as-is, the following special characters are escaped according to the following table:

Character	Dumped Sequence
U+0022 QUOTATION MARK	`\"`
U+005C REVERSE SOLIDUS	`\\`
U+000A LINE FEED (LF)	`\n`
U+000D CARRIAGE RETURN (CR)	`\r`
U+0009 CHARACTER TABULATION	`\t`
U+000C FORM FEED (FF)	`\f`
U+000B LINE TABULATION	`\v`
U+0008 BACKSPACE	`\b`
U+0007 BELL	`\a`
U+001B ESCAPE	`\e`

the following special sequences are also escaped:

Character	Dumped Sequence
`#$`	`\#$`
`#@`	`\#@`
`#{`	`\#{`

Valid UTF-8 byte sequences representing code points < 0x10000 are output as \un, where n is the four-digit uppercase hexadecimal representation of the code point.

Valid UTF-8 byte sequences representing code points ≥ 0x10000 are output as \u{n}, where n is the uppercase hexadecimal representation of the code point.

Any other byte is output as \xn, where n is the two-digit uppercase hexadecimal representation of the byte’s value.

hash_Fixnum#⚙

Returns the hash value of the receiver’s content.

hex_Integer#⚙

Returns the result of #to_i(16).

oct_Integer#⚙

Returns the result of #to_i(8), but with the added provision that any leading base specification in the receiver will override the suggested octal (8) base, that is, '0b11'.u#oct = 3, not 9.

to_i(base_{#to_int} = `16`)_Integer#⚙

Returns the Integer value that results from treating the receiver as a string of digits in base.

The conversion algorithm is

Skip any leading #space?s
Check for an optional sign, ‘+’ or ‘-’
If base is 2, skip an optional “0b” or “0B” prefix
If base is 8, skip an optional “0o” or “0o” prefix
If base is 10, skip an optional “0d” or “0D” prefix
If base is 16, skip an optional “0x” or “0X” prefix
Skip any ‘0’s
Read an as long sequence of digits in base separated by optional U+005F LOW LINE characters, using letters in the following ranges of characters for digits or the characters digit value, if any
- U+0041 LATIN CAPITAL LETTER A through U+005A LATIN CAPITAL LETTER Z
- U+0061 LATIN SMALL LETTER A through U+007A LATIN SMALL LETTER Z
- U+FF21 FULLWIDTH LATIN CAPITAL LETTER A through U+FF3A FULLWIDTH LATIN CAPITAL LETTER Z
- U+FF41 FULLWIDTH LATIN SMALL LETTER A through U+FF5A FULLWIDTH LATIN SMALL LETTER Z
Note that only one separator is allowed in a row.

Raises_{ArgumentError}: Unless 2 ≤ base ≤ 36

to_str_String#⚙

Returns the String representation of the receiver, inheriting any taint and untrust, encoded as UTF-8.

to_s_String#⚙

This is an alias for #to_str.

b_String#⚙

Returns the String representation of the receiver, inheriting any taint and untrust, encoded as ASCII-8BIT.

to_sym_Symbol#⚙

Returns the Symbol representation of the receiver.

Raises

EncodingError: If the receiver contains an invalid UTF-8 sequence
RuntimeError: If there’s no more room for a new Symbol in Ruby’s Symbol table

intern_Symbol#⚙

This is an alias for #to_sym.

StringClass

Constructor

initialize(stringString? = nil)#⚙

Instance Methods

uself#⚙

valid_encoding?Boolean#⚙

alnum?Boolean#⚙

alpha?Boolean#⚙

ascii_only?Boolean#⚙

assigned?Boolean#⚙

case_ignorable?Boolean#⚙

cased?Boolean#⚙

cntrl?Boolean#⚙

defined?Boolean#⚙

digit?Boolean#⚙

folded?(locale#to_str = ENV[LC_CTYPE])Boolean#⚙

graph?Boolean#⚙

lower?(locale#to_str = ENV[LC_CTYPE])Boolean#⚙

newline?Boolean#⚙

print?Boolean#⚙

punct?Boolean#⚙

soft_dotted?Boolean#⚙

space?Boolean#⚙

title?Boolean#⚙

upper?(locale#to_str = ENV[LC_CTYPE])Boolean#⚙

valid?Boolean#⚙

wide?Boolean#⚙

wide_cjk?Boolean#⚙

xdigit?Boolean#⚙

zero_width?Boolean#⚙

normalized?(mode#to_sym = :default)Boolean#⚙

==(otherU::String, #to_str)Boolean#⚙

===(otherU::String, #to_str)Boolean#⚙

=~(otherRegexp, #=~)Numeric?#⚙

match(patternRegexp, #to_str, index#to_int = 0)MatchData?#⚙

match(patternRegexp, #to_str, index#to_int = 0){ |matchdataMatchData| … }Object?#⚙

empty?Boolean#⚙

end_with?(*suffixesArray)Boolean#⚙

eql?(otherU::String)Boolean#⚙

include?(substring#to_str)Boolean#⚙

index(patternRegexp, #to_str, offset#to_int = 0)Integer?#⚙

rindex(patternRegexp, #to_str, offset#to_int = -1)Integer?#⚙

start_with?(*prefixesArray)Boolean#⚙

<=>(otherU::String, #to_str, locale#to_str = ENV['LC_COLLATE'])Fixnum#⚙

casecmp(otherU::String, #to_str, locale#to_str = ENV['LC_COLLATE'])Fixnum#⚙

collation_key(locale)U::String#⚙

canonical_combining_classFixnum#⚙

general_categorySymbol#⚙

grapheme_breakSymbol#⚙

line_breakSymbol#⚙

scriptSymbol#⚙

word_breakSymbol#⚙

bytesizeInteger#⚙

lengthInteger#⚙

sizeInteger#⚙

widthInteger#⚙

each_byte{ |byteFixnum| … }self#⚙

each_byteEnumerator#⚙

bytesArray<Fixnum>#⚙

each_char{ |charU::String| … }self#⚙

each_charEnumerator#⚙

charsArray<U::String>#⚙

each_codepoint{ |codepointInteger| … }self#⚙

each_codepointEnumerator#⚙

codepointsArray<Integer>#⚙

each_grapheme_cluster{ |clusterU::String| … }self#⚙

each_grapheme_clusterEnumerator#⚙

grapheme_clusters{ |clusterU::String| … }self#⚙

grapheme_clustersEnumerator#⚙

each_line(separatorU::String, #to_str = $/){ |lpU::String, self| … }self#⚙

each_line(separatorU::String, #to_str = $/)Enumerator#⚙

lines(separatorU::String, #to_str = $/)Array<U::String>#⚙

each_word{ |wordU::String| … }self#⚙

each_wordEnumerator#⚙

words{ |wordU::String| … }self#⚙

wordsEnumerator#⚙

[](index#to_int)U::String?#⚙

[](index#to_int, length#to_int)U::String?#⚙

[](rangeRange)U::String?#⚙

[](regexpRegexp, reference#to_int, #to_str, Symbol = 0)U::String?#⚙

String_Class

initialize(string_String^? = `nil`)#⚙

u_self#⚙

valid_encoding?_Boolean#⚙

alnum?_Boolean#⚙

alpha?_Boolean#⚙

ascii_only?_Boolean#⚙

assigned?_Boolean#⚙

case_ignorable?_Boolean#⚙

cased?_Boolean#⚙

cntrl?_Boolean#⚙

defined?_Boolean#⚙

digit?_Boolean#⚙

folded?(locale_{#to_str} = `ENV[LC_CTYPE]`)_Boolean#⚙

graph?_Boolean#⚙

lower?(locale_{#to_str} = `ENV[LC_CTYPE]`)_Boolean#⚙

newline?_Boolean#⚙

print?_Boolean#⚙

punct?_Boolean#⚙

soft_dotted?_Boolean#⚙

space?_Boolean#⚙

title?_Boolean#⚙

upper?(locale_{#to_str} = `ENV[LC_CTYPE]`)_Boolean#⚙

valid?_Boolean#⚙

wide?_Boolean#⚙

wide_cjk?_Boolean#⚙

xdigit?_Boolean#⚙

zero_width?_Boolean#⚙

normalized?(mode_{#to_sym} = `:default`)_Boolean#⚙

`==`(other_{U::String, #to_str})_Boolean#⚙

`===`(other_{U::String, #to_str})_Boolean#⚙

`=~`(other_{Regexp, #=~})_Numeric^?#⚙

match(pattern_{Regexp, #to_str}, index_{#to_int} = `0`)_MatchData^?#⚙

match(pattern_{Regexp, #to_str}, index_{#to_int} = `0`){ |matchdata_MatchData| … }_Object^?#⚙

empty?_Boolean#⚙

end_with?(*suffixes_Array)_Boolean#⚙

eql?(other_U::String)_Boolean#⚙

include?(substring_{#to_str})_Boolean#⚙

index(pattern_{Regexp, #to_str}, offset_{#to_int} = `0`)_Integer^?#⚙

rindex(pattern_{Regexp, #to_str}, offset_{#to_int} = `-1`)_Integer^?#⚙

start_with?(*prefixes_Array)_Boolean#⚙

`<=>`(other_{U::String, #to_str}, locale_{#to_str} = `ENV['LC_COLLATE']`)_Fixnum#⚙

casecmp(other_{U::String, #to_str}, locale_{#to_str} = `ENV['LC_COLLATE']`)_Fixnum#⚙

collation_key(locale)_U::String#⚙

canonical_combining_class_Fixnum#⚙

general_category_Symbol#⚙

grapheme_break_Symbol#⚙

line_break_Symbol#⚙

script_Symbol#⚙

word_break_Symbol#⚙

bytesize_Integer#⚙

length_Integer#⚙

size_Integer#⚙

width_Integer#⚙

each_byte{ |byte_Fixnum| … }_self#⚙

each_byte_Enumerator#⚙

bytes_{Array<Fixnum>}#⚙

each_char{ |char_U::String| … }_self#⚙

each_char_Enumerator#⚙

chars_{Array<U::String>}#⚙

each_codepoint{ |codepoint_Integer| … }_self#⚙

each_codepoint_Enumerator#⚙

codepoints_{Array<Integer>}#⚙

each_grapheme_cluster{ |cluster_U::String| … }_self#⚙

each_grapheme_cluster_Enumerator#⚙

grapheme_clusters{ |cluster_U::String| … }_self#⚙

grapheme_clusters_Enumerator#⚙

each_line(separator_{U::String, #to_str} = `$/`){ |lp_{U::String, self}| … }_self#⚙

each_line(separator_{U::String, #to_str} = `$/`)_Enumerator#⚙

lines(separator_{U::String, #to_str} = `$/`)_{Array<U::String>}#⚙

each_word{ |word_U::String| … }_self#⚙

each_word_Enumerator#⚙

words{ |word_U::String| … }_self#⚙

words_Enumerator#⚙

`[]`(index_{#to_int})_U::String^?#⚙

`[]`(index_{#to_int}, length_{#to_int})_U::String^?#⚙

`[]`(range_Range)_U::String^?#⚙

`[]`(regexp_Regexp, reference_{#to_int, #to_str, Symbol} = `0`)_U::String^?#⚙

`[]`(string_{U::String, ::String})_U::String^?#⚙

`[]`(object_Object)_nil#⚙