Blu-ray Wiki: Unicode

Unicode

**Unicode**

Type	Text encoding standard
Developer	Unicode Consortium
First Release	1991
Open Format?	Yes
Free Format?	Yes

Unicode is a standard character set: an assignment of numeric values to characters. A huge number of characters from various writing systems (modern or ancient), as well as special symbols of many types, are each given a number. On Blu-ray, Unicode is used for text-based subtitles and text fonts, it is also the standard used in coding Java.

Unicode is an international standard and is the dominant text encoding format. It was first published in 1991. Subsequent revisions have continually expanded its character repertoire. Unicode was developed in reaction to the unwieldy multiplicity of character sets that had arisen to include various subsets of the many characters left out of the English-centric ASCII set.

The standard way to denote a Unicode code point is to prefix it with "U+", and write the number in hexadecimal, with a minimum of four hex digits. For example, code point 42is written as U+002A, and code point 1,114,109 is U+10FFFD. Code points are the numbers assigned by the Unicode Consortium to every character in every writing system. Code points are represented as U+ followed by four numbers and/or letters. Another example: Interrobang "‽" is U+203D.

Each code point is also assigned a human-readable name, which may be written after the "U+" notation. For example, you might see "U+002A ASTERISK" or "U+03A9 GREEK CAPITAL LETTER OMEGA".

In the Blu-ray technical specification, all text encoding (both for coding and displaying text) uses Unicode 2.0 (UTF-8 and UTF-16BE) which is defined in ISO/IEC 10646-1:1993. This version contains 38,885 characters (excluding private-use characters, control characters, non-characters, and surrogate code points) including Basic Latin, Cyrillic, Greek, Kanji, CJK characters, and etc. Unicode 2.0, released in July 1996, was a significant update to the Unicode standard, expanding the character repertoire to 38,885 assigned characters across multiple blocks. These blocks organize characters by script, symbol type, or usage, and they reflect the state of text encoding standardization at that time. This version of Unicode may be "outdated" by today's standards, but in Blu-ray context, it's still very relevant today for BD development.

The reason why Unicode 2.0 was released because it was a well-established standard by the early 2000s. During Blu-ray’s development (starting around 2000, with specs finalized by 2004–2006), the BDA likely prioritized a mature standard with broad compatibility over a newer, less-tested version like Unicode 4.1. Newer Unicode versions often introduce additional characters and complexity, which could require more extensive validation and risk introducing bugs or incompatibilities in a consumer product aiming for a global launch. While “outdated” by 2005, Unicode 2.0 was a proven choice that met Blu-ray’s needs without overcomplicating the specification.

In BD applications, Unicode can be used with bitmap fonts (PNG) or victor-based fonts (OpenType). Most BD-J titles use bitmap fonts and use Java classes like StringBuffer, BufferedReader, FileInputStream, etc. to display the bitmap text and it's code points. Rarely used, but if BD-J used victor-based fonts using OpenType, the fonts and text would be stored inside the Blu-ray's 4 MB text cache and powered by the BD player's font rendering engine. A BD-J app should include a font file (OpenType,) that's Unicode 2.0 compatible for their BD-J application, if not, then the player will use it's own default font. However, the majority of players may not include fonts on their own, so it's best to include fonts files.

Unicode 2.0 defines a total of 38,885 assigned characters across its code points from U+0000 to U+FFFF (the Basic Multilingual Plane, BMP). The exact count comes from tallying each named entry across the 55 blocks.

List of Unicode Blocks in Unicode 2.0

Range	Block Name	Assigned Characters	Notes
U+0000–U+007F	Basic Latin	128	ASCII characters (letters, digits, punctuation, controls).
U+0080–U+00FF	Latin-1 Supplement	128	Additional Latin characters, symbols, and controls (e.g., £, ©).
U+0100–U+017F	Latin Extended-A	128	Extended Latin for European languages (e.g., Œ, Š).
U+0180–U+024F	Latin Extended-B	113	More Latin letters for African, Native American languages (e.g., Ɓ, ƒ).
U+0250–U+02AF	IPA Extensions	89	Phonetic symbols for International Phonetic Alphabet (e.g., ɐ, ʃ).
U+02B0–U+02FF	Spacing Modifier Letters	80	Modifiers for phonetics/typography (e.g., ʰ, ː).
U+0300–U+036F	Combining Diacritical Marks	112	Marks combining with base characters (e.g., ◌̀, ◌̈).
U+0370–U+03FF	Greek	135	Greek letters and symbols (e.g., α, Ω).
U+0400–U+04FF	Cyrillic	256	Cyrillic script for Slavic languages (e.g., А, Я).
U+0530–U+058F	Armenian	85	Armenian script (e.g., Ա, Ֆ).
U+0590–U+05FF	Hebrew	87	Hebrew script (e.g., א, ת).
U+0600–U+06FF	Arabic	237	Arabic script and symbols (e.g., ا, ى).
U+0900–U+097F	Devanagari	114	Script for Hindi, Sanskrit (e.g., अ, ह).
U+0980–U+09FF	Bengali	92	Bengali script (e.g., অ, হ).
U+0A00–U+0A7F	Gurmukhi	79	Script for Punjabi (e.g., ਅ, ਹ).
U+0A80–U+0AFF	Gujarati	83	Gujarati script (e.g., અ, હ).
U+0B00–U+0B7F	Oriya	81	Oriya script (e.g., ଅ, ହ).
U+0B80–U+0BFF	Tamil	72	Tamil script (e.g., அ, ஹ).
U+0C00–U+0C7F	Telugu	88	Telugu script (e.g., అ, హ).
U+0C80–U+0CFF	Kannada	86	Kannada script (e.g., ಅ, ಹ).
U+0D00–U+0D7F	Malayalam	89	Malayalam script (e.g., അ, ഹ).
U+0E00–U+0E7F	Thai	87	Thai script (e.g., ก, ๏).
U+0E80–U+0EFF	Lao	65	Lao script (e.g., ກ, ຳ).
U+0F00–U+0FFF	Tibetan	168	Tibetan script (e.g., ཀ, ྼ).
U+10A0–U+10FF	Georgian	83	Georgian script (e.g., Ⴀ, ჶ).
U+1100–U+11FF	Hangul Jamo	240	Korean Hangul components (e.g., ᄀ, ᇿ).
U+1E00–U+1EFF	Latin Extended Additional	185	More Latin extensions (e.g., Ḁ, ỿ).
U+1F00–U+1FFF	Greek Extended	233	Precomposed Greek with diacritics (e.g., ἀ, ῼ).
U+2000–U+206F	General Punctuation	71	Punctuation marks (e.g., —, ‘).
U+2070–U+209F	Superscripts and Subscripts	34	Superscript/subscript digits and letters (e.g., ⁰, ₓ).
U+20A0–U+20CF	Currency Symbols	12	Currency signs (e.g., ₧).
U+20D0–U+20FF	Combining Diacritical Marks for Symbols	33	Combining marks for symbols (e.g., ◌⃐, ◌⃡).
U+2100–U+214F	Letterlike Symbols	55	Symbols resembling letters (e.g., ℂ, ℏ).
U+2150–U+218F	Number Forms	50	Fractions, Roman numerals (e.g., ½, Ⅻ).
U+2190–U+21FF	Arrows	91	Arrow symbols (e.g., ←, ).
U+2200–U+22FF	Mathematical Operators	256	Math symbols (e.g., ∀, √).
U+2300–U+23FF	Miscellaneous Technical	126	Technical symbols (e.g., ⌈, ).
U+2400–U+243F	Control Pictures	39	Graphical representations of control codes (e.g., ␀, ␣).
U+2440–U+245F	Optical Character Recognition	11	OCR-specific symbols (e.g., ⑀, ⑊).
U+2460–U+24FF	Enclosed Alphanumerics	160	Circled numbers/letters (e.g., ①, ⓿).
U+2500–U+257F	Box Drawing	128	Line-drawing characters (e.g., ─, ┼).
U+2580–U+259F	Block Elements	32	Block graphic characters (e.g., ▀, █).
U+25A0–U+25FF	Geometric Shapes	96	Shapes (e.g., ■, ◯).
U+2600–U+26FF	Miscellaneous Symbols	171	Various symbols (e.g., , ).
U+2700–U+27BF	Dingbats	174	Decorative symbols (e.g., ✁, ❏).
U+3000–U+303F	CJK Symbols and Punctuation	63	CJK-specific punctuation (e.g., 、, 〿).
U+3040–U+309F	Hiragana	93	Japanese Hiragana (e.g., ぁ, ん).
U+30A0–U+30FF	Katakana	96	Japanese Katakana (e.g., ァ, ヿ).
U+3100–U+312F	Bopomofo	27	Chinese phonetic script (e.g., ㄅ, ㄩ).
U+3130–U+318F	Hangul Compatibility Jamo	94	Legacy Korean Jamo (e.g., ㄱ, ㅿ).
U+3200–U+32FF	Enclosed CJK Letters and Months	191	Enclosed CJK characters (e.g., ㈀, ㋿).
U+3300–U+33FF	CJK Compatibility	256	Compatibility CJK variants (e.g., ㌀, ㏿).
U+4E00–U+9FFF	CJK Unified Ideographs	20,902	Core Chinese/Japanese/Korean characters (e.g., 一, 龥).
U+AC00–U+D7A3	Hangul Syllables	11,172	Precomposed Korean syllables (e.g., 가, 힣).
U+E000–U+F8FF	Private Use Area	0 (reserved)	No predefined characters; for custom use.
U+F900–U+FAFF	CJK Compatibility Ideographs	302	Compatibility variants of CJK ideographs (e.g., 豈, ﾾ).
U+FB00–U+FB4F	Alphabetic Presentation Forms	58	Precomposed ligatures (e.g., ﬀ, ﬅ).
U+FB50–U+FDFF	Arabic Presentation Forms-A	611	Arabic contextual forms (e.g., ﭐ, ﷿).
U+FE20–U+FE2F	Combining Half Marks	16	Half-width combining marks (e.g., ◌︠, ◌︯).
U+FE30–U+FE4F	CJK Compatibility Forms	32	Vertical CJK punctuation variants (e.g., �30, ︴).
U+FE50–U+FE6F	Small Form Variants	26	Small CJK punctuation (e.g., ﹐, ﹯).
U+FE70–U+FEFF	Arabic Presentation Forms-B	141	More Arabic forms, includes U+FEFF (BOM) (e.g., ﹰ, zero-width no-break).
U+FF00–U+FFEF	Halfwidth and Fullwidth Forms	225	Fullwidth ASCII, halfwidth Katakana/Hangul (e.g., ！, ｦ).
U+FFF0–U+FFFF	Specials	6	Special-purpose characters (e.g., ￹, �).

Full lists of supported Characters for Unicode 2.0

Missing Characters

Since Unicode 2.0 was released in 1996, it's missing several key symbols that emerged later, most notably the Euro sign (€), introduced in 1999 and added in Unicode 2.1 (U+20AC). Other absences include modern currency symbols like the Indian Rupee (₹, U+20B9, Unicode 6.0), extensive emoji sets (e.g. Unicode 6.0+), and newer scripts like Cherokee (added in Unicode 3.0). These gaps reflect Unicode 2.0’s pre-1996 scope, limited to 38,885 characters in the BMP. The Private Use Area (PUA, U+E000–U+F8FF), with 6,400 unassigned code points, offers a workaround: developers can assign custom glyphs—like the Euro sign or proprietary icons—to PUA slots and pair them with a custom font.

If a character from a newer Unicode version is used, it will appear as a replacement character "�", a box "􏿮", or nothing at all.

Emoji Support

Since it uses Unicode 2.0, it has limited emoji-like support, offering only basic symbols like (U+263A) or (U+2665) in the Arrows, Miscellaneous Technical, Miscellaneous Symbols, Dingbats, Geometric Shapes and CJK Symbols and Punctuation blocks. The player does not display rasterized bitmap images or layered graphics by default, the developer will have to do that manually for the BD-J application (using small PNG graphics). The emoji-like characters will start as scalable vector-based symbols in font formats (e.g., OpenType, via fonts like Noto Emoji) by default.

List of Emojis and Unique Miscellaneous Symbols

	0	1	2	3	4	5	6	7	8	9
2	203C	2194	2195	2196	2197	2198	2199	21A9	21AA	231A
3	231B	2328	⎗ 2397	⎘ 2398	⎙ 2399	⎚ 239A	24C2	25B6	25C0	25FB
4	25FC	25FD	25FE	2600	2601	2602	2603	2604	★ 2605	☆ 2606
5	260E	☏ 260F	☐ 2610	☒ 2612	☚ 261A	☛ 261B	☜ 261C	261D	☞ 261E	☟ 261F
6	2620	☡ 2621	2622	2623	2626	262A	262E	262F	2638	2639
7	263A	☻ 263B	☼ 263C	☽ 263D	☾ 263E	2640	2642	2648	2649	264A
8	264B	264C	264D	264E	264F	2650	2651	2652	2653	♔ 2654
9	♕ 2655	♖ 2656	♗ 2657	♘ 2658	♙ 2659	♚ 265A	♛ 265B	♜ 265C	♝ 265D	♞ 265E
A	♟ 265F	2660	♡ 2661	♢ 2662	2663	♤ 2664	2665	2666	♧ 2667	2668
B	♩ 2669	♪ 266A	♫ 266B	♬ 266C	♭ 266D	♮ 266E	♯ 266F	✁ 2701	2702	✃ 2703
C	✄ 2704	✆ 2706	✇ 2707	2708	2709	270C	270D	270F	2712	2714
D	2716	✚ 271A	271D	✞ 271E	✠ 2720	2721	✤ 2724	✧ 2727	✩ 2729	✪ 272A
E	2733	❀ 2740	2744	2747	❖ 2756	2763	2764	❥ 2765	❦ 2766	❧ 2767
F	27A1	27BF	〄 3004	〠 3020	3030	〶 3036	㉿ 327F	3297	3299

This is not a full list but these are top suggestions for text in BD applications such as subtitles or video games.

Code charts and references

Official Unicode 2.0 Documentation - Highly Recommended
Unicode official site -- has lots of standards documents and code charts
Unicode.org - Unicode Official Homepage
Codepoints.net - Unicode Database (Best one)
Unicodepedia.com - Unicode Database
Unicode page at Archiveteam.org
Wikipedia Page
Wikipedia list of Unicode Characters

Author(s) : Æ Firestone

on Sunday, August 25, 2024 | Code, Misc. | A comment?

0 responses to “Unicode”

Search

Catagories

Translate

Unicode

Code charts and references

Leave a Reply

Popular Pages