Unicode

Unicode
Blu-ray Disc.svg
Type Text encoding standard
Developer Unicode Consortium
First Release 1991
Open Format? Yes
Free Format? Yes

Unicode is a standard character set: an assignment of numeric values to characters. A huge number of characters from various writing systems (modern or ancient), as well as special symbols of many types, are each given a number. On Blu-ray, Unicode is used for text-based subtitles and text fonts, it is also the standard used in coding Java.

Unicode is an international standard and is the dominant text encoding format. It was first published in 1991. Subsequent revisions have continually expanded its character repertoire. Unicode was developed in reaction to the unwieldy multiplicity of character sets that had arisen to include various subsets of the many characters left out of the English-centric ASCII set.

The standard way to denote a Unicode code point is to prefix it with "U+", and write the number in hexadecimal, with a minimum of four hex digits. For example, code point 42is written as U+002A, and code point 1,114,109 is U+10FFFD. Code points are the numbers assigned by the Unicode Consortium to every character in every writing system. Code points are represented as U+ followed by four numbers and/or letters. Another example: Interrobang "‽" is U+203D.

Each code point is also assigned a human-readable name, which may be written after the "U+" notation. For example, you might see "U+002A ASTERISK" or "U+03A9 GREEK CAPITAL LETTER OMEGA".

In the Blu-ray technical specification, all text encoding (both for coding and displaying text) uses Unicode 2.0 (UTF-8 and UTF-16BE) which is defined in ISO/IEC 10646-1:1993. This version contains 38,885 characters (excluding private-use characters, control characters, non-characters, and surrogate code points) including Basic Latin, Cyrillic, Greek, Kanji, CJK characters, and etc. Unicode 2.0, released in July 1996, was a significant update to the Unicode standard, expanding the character repertoire to 38,885 assigned characters across multiple blocks. These blocks organize characters by script, symbol type, or usage, and they reflect the state of text encoding standardization at that time. This version of Unicode may be "outdated" by today's standards, but in Blu-ray context, it's still very relevant today for BD development. 

The reason why Unicode 2.0 was released because it was a well-established standard by the early 2000s. During Blu-ray’s development (starting around 2000, with specs finalized by 2004–2006), the BDA likely prioritized a mature standard with broad compatibility over a newer, less-tested version like Unicode 4.1. Newer Unicode versions often introduce additional characters and complexity, which could require more extensive validation and risk introducing bugs or incompatibilities in a consumer product aiming for a global launch. While “outdated” by 2005, Unicode 2.0 was a proven choice that met Blu-ray’s needs without overcomplicating the specification. 

In BD applications, Unicode can be used with bitmap fonts (PNG) or victor-based fonts (OpenType). Most BD-J titles use bitmap fonts and use Java classes like StringBuffer, BufferedReader, FileInputStream, etc. to display the bitmap text and it's code points. Rarely used, but if BD-J used victor-based fonts using OpenType, the fonts and text would be stored inside the Blu-ray's 4 MB text cache and powered by the BD player's font rendering engine. A BD-J app should include a font file (OpenType,) that's Unicode 2.0 compatible for their BD-J application, if not, then the player will use it's own default font. However, the majority of players may not include fonts on their own, so it's best to include fonts files.

Unicode 2.0 defines a total of 38,885 assigned characters across its code points from U+0000 to U+FFFF (the Basic Multilingual Plane, BMP). The exact count comes from tallying each named entry across the 55 blocks.

List of Unicode Blocks in Unicode 2.0
 
Range
Block Name
Assigned Characters
Notes
U+0000–U+007F
Basic Latin
128
ASCII characters (letters, digits, punctuation, controls).
U+0080–U+00FF
Latin-1 Supplement
128
Additional Latin characters, symbols, and controls (e.g., £, ©).
U+0100–U+017F
Latin Extended-A
128
Extended Latin for European languages (e.g., Œ, Š).
U+0180–U+024F
Latin Extended-B
113
More Latin letters for African, Native American languages (e.g., Ɓ, ƒ).
U+0250–U+02AF
IPA Extensions
89
Phonetic symbols for International Phonetic Alphabet (e.g., ɐ, ʃ).
U+02B0–U+02FF
Spacing Modifier Letters
80
Modifiers for phonetics/typography (e.g., ʰ, ː).
U+0300–U+036F
Combining Diacritical Marks
112
Marks combining with base characters (e.g., ◌̀, ◌̈).
U+0370–U+03FF
Greek
135
Greek letters and symbols (e.g., α, Ω).
U+0400–U+04FF
Cyrillic
256
Cyrillic script for Slavic languages (e.g., А, Я).
U+0530–U+058F
Armenian
85
Armenian script (e.g., Ա, Ֆ).
U+0590–U+05FF
Hebrew
87
Hebrew script (e.g., א, ת).
U+0600–U+06FF
Arabic
237
Arabic script and symbols (e.g., ا, ى).
U+0900–U+097F
Devanagari
114
Script for Hindi, Sanskrit (e.g., अ, ह).
U+0980–U+09FF
Bengali
92
Bengali script (e.g., অ, হ).
U+0A00–U+0A7F
Gurmukhi
79
Script for Punjabi (e.g., ਅ, ਹ).
U+0A80–U+0AFF
Gujarati
83
Gujarati script (e.g., અ, હ).
U+0B00–U+0B7F
Oriya
81
Oriya script (e.g., ଅ, ହ).
U+0B80–U+0BFF
Tamil
72
Tamil script (e.g., அ, ஹ).
U+0C00–U+0C7F
Telugu
88
Telugu script (e.g., అ, హ).
U+0C80–U+0CFF
Kannada
86
Kannada script (e.g., ಅ, ಹ).
U+0D00–U+0D7F
Malayalam
89
Malayalam script (e.g., അ, ഹ).
U+0E00–U+0E7F
Thai
87
Thai script (e.g., ก, ๏).
U+0E80–U+0EFF
Lao
65
Lao script (e.g., ກ, ຳ).
U+0F00–U+0FFF
Tibetan
168
Tibetan script (e.g., ཀ, ྼ).
U+10A0–U+10FF
Georgian
83
Georgian script (e.g., Ⴀ, ჶ).
U+1100–U+11FF
Hangul Jamo
240
Korean Hangul components (e.g., ᄀ, ᇿ).
U+1E00–U+1EFF
Latin Extended Additional
185
More Latin extensions (e.g., Ḁ, ỿ).
U+1F00–U+1FFF
Greek Extended
233
Precomposed Greek with diacritics (e.g., ἀ, ῼ).
U+2000–U+206F
General Punctuation
71
Punctuation marks (e.g., —, ‘).
U+2070–U+209F
Superscripts and Subscripts
34
Superscript/subscript digits and letters (e.g., ⁰, ₓ).
U+20A0–U+20CF
Currency Symbols
12
Currency signs (e.g., ₧).
U+20D0–U+20FF
Combining Diacritical Marks for Symbols
33
Combining marks for symbols (e.g., ◌⃐, ◌⃡).
U+2100–U+214F
Letterlike Symbols
55
Symbols resembling letters (e.g., ℂ, ℏ).
U+2150–U+218F
Number Forms
50
Fractions, Roman numerals (e.g., ½, Ⅻ).
U+2190–U+21FF
Arrows
91
Arrow symbols (e.g., ←, ➡).
U+2200–U+22FF
Mathematical Operators
256
Math symbols (e.g., ∀, √).
U+2300–U+23FF
Miscellaneous Technical
126
Technical symbols (e.g., ⌈, ⏰).
U+2400–U+243F
Control Pictures
39
Graphical representations of control codes (e.g., ␀, ␣).
U+2440–U+245F
Optical Character Recognition
11
OCR-specific symbols (e.g., ⑀, ⑊).
U+2460–U+24FF
Enclosed Alphanumerics
160
Circled numbers/letters (e.g., ①, ⓿).
U+2500–U+257F
Box Drawing
128
Line-drawing characters (e.g., ─, ┼).
U+2580–U+259F
Block Elements
32
Block graphic characters (e.g., ▀, █).
U+25A0–U+25FF
Geometric Shapes
96
Shapes (e.g., ■, ◯).
U+2600–U+26FF
Miscellaneous Symbols
171
Various symbols (e.g., ☀, ).
U+2700–U+27BF
Dingbats
174
Decorative symbols (e.g., ✁, ❏).
U+3000–U+303F
CJK Symbols and Punctuation
63
CJK-specific punctuation (e.g., 、, 〿).
U+3040–U+309F
Hiragana
93
Japanese Hiragana (e.g., ぁ, ん).
U+30A0–U+30FF
Katakana
96
Japanese Katakana (e.g., ァ, ヿ).
U+3100–U+312F
Bopomofo
27
Chinese phonetic script (e.g., ㄅ, ㄩ).
U+3130–U+318F
Hangul Compatibility Jamo
94
Legacy Korean Jamo (e.g., ㄱ, ㅿ).
U+3200–U+32FF
Enclosed CJK Letters and Months
191
Enclosed CJK characters (e.g., ㈀, ㋿).
U+3300–U+33FF
CJK Compatibility
256
Compatibility CJK variants (e.g., ㌀, ㏿).
U+4E00–U+9FFF
CJK Unified Ideographs
20,902
Core Chinese/Japanese/Korean characters (e.g., 一, 龥).
U+AC00–U+D7A3
Hangul Syllables
11,172
Precomposed Korean syllables (e.g., 가, 힣).
U+E000–U+F8FF
Private Use Area
0 (reserved)
No predefined characters; for custom use.
U+F900–U+FAFF
CJK Compatibility Ideographs
302
Compatibility variants of CJK ideographs (e.g., 豈, ᄒ).
U+FB00–U+FB4F
Alphabetic Presentation Forms
58
Precomposed ligatures (e.g., ff, ſt).
U+FB50–U+FDFF
Arabic Presentation Forms-A
611
Arabic contextual forms (e.g., ﭐ, ﷿).
U+FE20–U+FE2F
Combining Half Marks
16
Half-width combining marks (e.g., ◌︠, ◌︯).
U+FE30–U+FE4F
CJK Compatibility Forms
32
Vertical CJK punctuation variants (e.g., �30, ︴).
U+FE50–U+FE6F
Small Form Variants
26
Small CJK punctuation (e.g., ﹐, ﹯).
U+FE70–U+FEFF
Arabic Presentation Forms-B
141
More Arabic forms, includes U+FEFF (BOM) (e.g., ﹰ, zero-width no-break).
U+FF00–U+FFEF
Halfwidth and Fullwidth Forms
225
Fullwidth ASCII, halfwidth Katakana/Hangul (e.g., !, ヲ).
U+FFF0–U+FFFF
Specials
6
Special-purpose characters (e.g., , �).


Full lists of supported Characters for Unicode 2.0


Missing Characters

Since Unicode 2.0 was released in 1996, it's missing several key symbols that emerged later, most notably the Euro sign (€), introduced in 1999 and added in Unicode 2.1 (U+20AC). Other absences include modern currency symbols like the Indian Rupee (₹, U+20B9, Unicode 6.0), extensive emoji sets (e.g. Unicode 6.0+), and newer scripts like Cherokee (added in Unicode 3.0). These gaps reflect Unicode 2.0’s pre-1996 scope, limited to 38,885 characters in the BMP. The Private Use Area (PUA, U+E000–U+F8FF), with 6,400 unassigned code points, offers a workaround: developers can assign custom glyphs—like the Euro sign or proprietary icons—to PUA slots and pair them with a custom font. 

If a character from a newer Unicode version is used, it will appear as a replacement character "�", a box "􏿮", or nothing at all.


Emoji Support

Since it uses Unicode 2.0, it has limited emoji-like support, offering only basic symbols like ☺ (U+263A) or ♥ (U+2665) in the Arrows, Miscellaneous Technical, Miscellaneous Symbols, Dingbats, Geometric Shapes and CJK Symbols and Punctuation blocks. The player does not display rasterized bitmap images or layered graphics by default, the developer will have to do that manually for the BD-J application (using small PNG graphics). The emoji-like characters will start as scalable vector-based symbols in font formats (e.g., OpenType, via fonts like Noto Emoji) by default.

List of Emojis and Unique Miscellaneous Symbols
0
1
2
3
4
5
6
7
8
9
2
‼
203C
↔
2194
↕
2195
↖
2196
↗
2197
↘
2198
↙
2199
↩
21A9
↪
21AA
⌚
231A
3
⌛
231B
⌨
2328
2397
2398
2399
239A
Ⓜ
24C2
▶
25B6
◀
25C0
◻
25FB
4
◼
25FC
◽
25FD
◾
25FE
☀
2600
☁
2601
☂
2602
☃
2603
☄
2604
2605
2606
5
☎
260E
260F
2610
2612
261A
261B
261C
☝
261D
261E
261F
6
☠
2620
2621
☢
2622
☣
2623
☦
2626
☪
262A
☮
262E
☯
262F
☸
2638
☹
2639
7
☺
263A
263B
263C
263D
263E
♀
2640
♂
2642
♈
2648
♉
2649
♊
264A
8
♋
264B
♌
264C
♍
264D
♎
264E
♏
264F
♐
2650
♑
2651
♒
2652
♓
2653
2654
9
2655
2656
2657
2658
2659
265A
265B
265C
265D
265E
A
265F
♠
2660
2661
2662
♣
2663
2664
♥
2665
♦
2666
2667
♨
2668
B
2669
266A
266B
266C
266D
266E
266F
2701
✂
2702
2703
C
2704
2706
2707
✈
2708
✉
2709
✌
270C
✍
270D
✏
270F
✒
2712
✔
2714
D
✖
2716
271A
✝
271D
271E
2720
✡
2721
2724
2727
2729
272A
E
✳
2733
2740
❄
2744
❇
2747
2756
❣
2763
❤
2764
2765
2766
2767
F
➡
27A1
➿
27BF
3004
3020
〰
3030
3036
327F
㊗
3297
㊙
3299

This is not a full list but these are top suggestions for text in BD applications such as subtitles or video games.





Code charts and references


 

Author(s) : Æ Firestone

on Sunday, August 25, 2024 | , | A comment?
0 responses to “Unicode”

Leave a Reply

Popular Pages