First, check whether I can actually render Korean characters. I know that Unicode supports nearly all of the glyphs (or, more correctly it seems, “ideographs”) needed for Chinese, Japanese, Korean (typically abbreviated as “CJK”) – but I also know that MS Windows fonts are infamous for missing some/most/all of the unicode characters (there are tens of thousands of them, so this is not all that surprising – except that Microsoft has been shipping OS’s localized to those countries for many years now, so I’d hoped maybe they would include at least ONE “CJK + Latin” font on all machines by now, no? No; they don’t).
Quick test program using Java (NB: java one of the easier languages for this because it was invented late enough that Unicode had settled down, and so it has extremely good Unicode support built-in to the core libraries, unlike C++ et al. By default, Java uses unicode almost everywhere, so I don’t need to debug unicode support, yay!):
NB: …but then when I tried it I quickly discovered a problem. I quickly gave up on the obvious route, so might have missed a workaround, but… I could have done this using Java’s built-in GUI toolkit (Swing), which supports using any font as a text label. However, it seems there’s a design flaw in Swing that’s been around since 1997 (yes, folks, 11 years and counting): the only component that can have the font changed – JLabel – which is a “general-purpose” label for any GUI component (good idea!) … is incompatible with the only component capable of being a button – JButton – which is a “general-purpose” button. Crap. If I’m not missing something here, with one small piece of poor API design, Sun have made their *almost* international-friendly GUI system almost completely useless for multi-language GUIs. Just to be clear: it’s not specifically internationalization they’ve broken here: its a general issue – any forms of customized rendering cannot use the JButton / JLabel combo. Why? Why, why, why would you do such a stupid thing, after going to the effort of making these generic widgets? Oh, well. Custom rendering it is…
/**
* Simple test program that renders Unicode characters from a particular font of
* your choice, a couple of thousand at a time, automatically wrapping them based on
* render width on screen.
*
* You need to manually change the font name and the start/end co-ordinates of where
* in the unicode constellation to render.
*/
import java.awt.*;
import javax.swing.*;
public class PaintSomeUnicodeChars extends JFrame
{
void setup()
{
getContentPane().add( new charPane() );
}
public static void main( String[] args )
{
PaintSomeUnicodeChars window = new PaintSomeUnicodeChars();
window.setDefaultCloseOperation( JFrame.EXIT_ON_CLOSE );
window.setup();
window.pack();
window.setSize( 600, 500 );
window.setVisible( true );
}
}
class charPane extends JPanel
{
public void paint( Graphics g )
{
int si = 44032;
int ei = 46000;
g.setColor( Color.red );
g.setFont( new Font( "Verdana", 10, 40 ) );
g.drawString( "Chars from " + si + " to " + ei, 30, 30 );
g.setColor( Color.black );
g.setFont( new Font( "Lucida Sans Unicode", 10, 20 ) );
int accumulator = 0;
int height = 20;
int xwidth = getWidth();
for( int i = si; i < ei; i++ )
{
char[] chars = Character.toChars( i );
String drawString = String.copyValueOf( chars );
int advance = g.getFontMetrics().charsWidth( chars, 0, chars.length );
if( accumulator + advance > xwidth )
{
accumulator = 0;
height += g.getFontMetrics().getHeight();
}
g.drawString( drawString, accumulator, height );
accumulator += advance;
}
}
}
Why choose Lucida Sans Unicode?
Apart from the fact that it’s present on all modern Windows machines automatically, I cheated a bit and used Character Map (Start > Programs > Accessories > System Tools > Character Map) to quickly inspect each of the fonts already installed and find out which ones had lots of unicode in them. Character Map has a neat feature where it completely ignores missing characters, automatically skipping over them, so you can very quickly see if a font has lots of unicode, some, or very little.
Stepping through in the java app, I fairly quickly found lots of Unicode characters. success – I can correctly render arbitrary Unicode characters in java.
NB: important because it took a lot more lines of source than I expected; java’s built-in “char” datatype is incapable of being used to render Unicode, because it’s too small a range of values. That also means you can’t use ANY methods that take a char as argument (this is documented in the Java API docs for Character). I’d never really used the methods that take int’s as argument before…
Fonts…
Not really happy with the font, though, and it’s clearly missing a whole bunch of characters. Some googling for free CJK fonts found me that allegedly Arial Unicode MS would work – which comes free with Microsoft Office (which I have on one of my machines).
NB: the way font licensing works, it’s probably illegal to copy a font you have on one machine to another machine you own. Unless you can obtain the font from the original source (e.g. in this case by buying a second copy of MS Office).
Arial Unicode MS has lots of characters, but … they’re wrong. At least, one third of the ranges of Korean characters are completely useless, as far as I can tell, because whoever made this font (whoever Microsoft licensed it from?) didn’t read the spec carefully and stuck the wrong characters in from U+1100-U+11FF. More on this later…
Giving up on that font, I went looking for others. The first four or five I tried that didn’t look ugly were all from download sites in Japan or Korea and kept crashing on the download. I found a couple of places hosting UnBatang (“UnBatangOdal.ttf”) and claiming it was free, but the first I found where the download didn’t crash was on http://www.i18nl10n.com/fonts
UnBatang (the name you have to use in Windows / Java from inside an application in order to load it) seems to work very well. It has nicely painted ideographs for the main range of Hangul characters/syllables, and it has *correct* glyphs for the other two ranges (although they’re a bit ugly).
Unicode Hangul
Specifications for official standards tend to be big. Really big.
Specifications for anything to do with internationalization tend to be huge.
Add in localization and data-transfer between different cultures, and you can expect something gargantuan.
So, I was really really happy to find a downloadable copy of only the CJK Chapter of the Unicode 5.0 specification (it’s still a big document!). This contains a full explanation of what Hangul is in Unicode, where to find it (yay!), and why there are not one but three separate sets of Hangul glyphs.
Here’s the preface to the Chapter 12 PDF I downloaded, in case you want to find the spec / electronic version yourself:
Electronic Edition
This file is part of the electronic edition of The Unicode Standard, Version 5.0, provided for online
access, content searching, and accessibility. It may not be printed. Bookmarks linking to specific
chapters or sections of the whole Unicode Standard are available at
http://www.unicode.org/versions/Unicode5.0.0/bookmarks.html
Purchasing the Book
For convenient access to the full text of the standard as a useful reference book, we recommend purchasing
the printed version. The book is available from the Unicode Consortium, the publisher, and
booksellers. Purchase of the standard in book format contributes to the ongoing work of the Unicode
Consortium. Details about the book publication and ordering information may be found at
http://www.unicode.org/book/aboutbook.html
Unicode 5, Hangul, and Arial Unicode MS
So, armed with the offical spec, I read up on Hangul. What did I find?
The Unicode Standard contains both the complete set of precomposed modern Hangul syllable
blocks and the set of conjoining Hangul jamo. This set of conjoining Hangul jamo can
be used to encode all modern and ancient syllable blocks.
(this is the glyphs at U+1100–U+11FF)
“conjoining Hangul jamo” means “these glyphs have been positioned inside their spaces in the font so that if you need to make one ideograph out of, say, four Korean letters, you just pick the top-left version of the first letter, the top-right version of the second, etc, and OVERLAY all the glyphs, and what comes out will autamatically be correctly spaced out etc”.
What did the author of Arial Unicode MS do?
Made all those glyphs take up the full available space, and centre them horizontally and vertically.
Why? Seriously, why? Because any application that tries to render those characters is going to render them on top of each other (according to the specification, this is the ONLY point of having those characters), and you won’t be able to read at all what the letters say.
If you want to render just individual letters from the Korean alphabet, there’s a different range of Unicode where you can find them all centred etc (which Arial Unicode MS also has … so it seems to be just copy/pasting internally).
I guess the font author just didn’t read the spec. Or I’m completely misunderstanding the spec. But the fact that UnBatang spaces the conjoining jamos out in such a way that this works as I expected it to suggests to me that it’s the Arial Unicode that’s broken…
The joy of Conjoining
I couldn’t get hold of the Unicode spec section on how to conjoin, because of some mimetype problems between their server and my PDA’s web browser. I figured I could probably work it out by trial and error fairly quickly.
I can’t remember how to spell my name in Korean yet – I know the letters, I’m just a bit flaky on what the placement of them is. So, a quick experiment, to see how easy it is to position characters using the Conjoining Jamo from UnBatang:
... constants used later - the Unicode values for the letters of my name: ...
int a = 0x1161; // jungseong vowel
int d = 0x1103; // choseong consonant
int d2 = 0x11ae; // jongseong consonant
int m = 0x1106; // choseong consonant
int m2 = 0x11b7; // jongseong consonant
int blank = 0x110b; // silent char you place at front if first letter is a vowel
... the body of the paint method: ...
g.setColor( Color.black );
g.setFont( new Font( "UnBatang", 10, 30 ) );
int w = 0;
int lead = 0;
w += renderIdeograph( g, blank, lead, 40 );
w += renderIdeograph( g, a, lead+w, 40 );
w += renderIdeograph( g, d2, lead+w, 40 );
lead+=30; w = 0;
w += renderIdeograph( g, blank, lead+w, 40 );
w += renderIdeograph( g, a, lead+w, 40 );
w += renderIdeograph( g, m2, lead+w, 40 );
lead=0; w = 0;
w += renderIdeograph( g, blank, lead+w, 80 );
w += renderIdeograph( g, a, lead+w, 80 );
lead+=30; w = 0;
w += renderIdeograph( g, d, lead+w, 80 );
w += renderIdeograph( g, a, lead+w, 80 );
w += renderIdeograph( g, m2, lead+w, 80 );
... and then this method to do the paint of each part of an ideograph: ...
/** Doesn't really render an ideograph, renders a single glyph from the Font */
public int renderIdeograph( Graphics g, int i, int x, int y )
{
char[] chars = Character.toChars( i );
String drawString = String.copyValueOf( chars );
int advance = g.getFontMetrics().charsWidth( chars, 0, chars.length );
g.drawString( drawString, x, y );
return advance;
}
Which renders exactly like this:
Which makes me want to point out the nice thing about properly specified fonts: in the source code, I simply did “the natural thing”, as if I were outputting characters in an arbitrary conjoined language:
- Render the first part of the first ideograph at (x,y)
- Ask the font to tell you how many pixels wide (w) it just rendered that part
- Render the next part of the first ideograph at (x+w,y)
- …repeat until first ideograph is complete, increasing w more and more each time…
- Choose a value for how much you want the ideographs separated from start to start (ideograph_width)
- Render the first part of the second ideograph at (x + ideograph_width, y)
- …repeat as for first ideograph
And it worked. First time. I didn’t expect it to – I expected to have to do something strange like manually “reset” the (w) value each time I went from the first line of the ideograph to the second (Korean orders letters top-left, top-right, bottom-left, bottom-right, (repeat) … as opposed to Latin which is just left, right, more right, even more right, etc).
For this to have worked, it means the font is deliberately rendering the lower letters (e.g. d2 and m2 in my constants) a long way to the left of the origin that you tell it to render them at. This would be very, very confusing if you just tried to render these letters individually (well, duh).
Of course, if you try to run that code using Microsoft’s Arial Unicode MS font, you get a complete mess instead, because that font is FUBAR, as mentioned before. You get this:
…which is completely incomprehensible.