By Cal Henderson, October 21st 2009.
If you haven't been locked in a small box for the past year, then you'll be familiar with the screenshot on the left - the Emoji keyboard on the iPhone. If you haven't unlocked yours yet, just go into the app store and search for 'emoji' to find an app.
Emoji is the Japanese term for small emoticon pictures. The word 'e moji' literally means 'picture letter'. They've seen widespread adoption in the Japanese mobile market, with multiple providers adding the ability to send these pictograms as part of text messages, emails and on web pages.
This is achieved via the Unicode Private Use character space, a large set of code points where vendors can define their own meanings. This is all well and good, but of course nobody agrees on what code points should mean which pictures.
The players in the space for the most part were NTT DoCoMo, KDDI/Au and Softbank/Vodafone. More recently, Apple have added emoji support to the iPhone and Google to the Android mobile operating system. While Apple chose to use the Softbank/Vodafone mappings, all of the others are distinct, creating four different standards.
Exchanging little pictures in SMS is pretty fun for while, but why should that fun be limited to SMS? If we're building web apps which receive data from mobile devices or display data on them, we should be able to leverage these little critters. We can even create them ourselves as iconography without the cost of creating and serving image files.
But this isn't trivial - different devices use different code points (an iPhone can't display a smiley face entered on an Android phone) and people will be using your web app with a regular desktop browser too. Luckily, there's a reasonable solution to all these problems.
There's a Unicode proposal from folks at Google and Apple to standardize emoji down to a single set of code points. While these code points haven't yet been formalized, this proposal contains a way to map each of the four different schemes. By using this unified mapping internally within our application, we can convert from any input scheme to any other output scheme.
At input time, we need to know where the data has come from, since some of the different sets overlap - both Softbank's 'frog face' and KDDI's 'small white square' are represented by U+E531. If we convert from the native encoding to the unified encoding for storage, we don't need to store the format. At output time, we can detect which platform we're delivering content to and convert from the unified set to the platform specific set.
Not all of the platform's emoji sets are equal though, so some symbols appear in only one set, or appear multiple times. For instance, KDDI has an emoji called 'large white square' with a code point of U+E548. DoCoMo doesn't have any equivalent at all, while Softbank has U+E21B meaning 'white rounded square' which can be used in its place. The proposal includes one-way mappings for some missing characters and replacement text for others - while all emoji are not supported by all platforms, there is a recommended replacement for them.
This is all well and good, but what should we do if someone enters emoji on their device which we want to display to non-mobile users? The unified set are Unicode code points which are defined by some fonts, but the pictures will be monotone and are missing from most common web fonts.
With a smart bit of output-processing, we can replace characters from the unified set with HTML to show inline images. At display time, our code can search for instances of the UTF-8 byte sequence
0xE2 0x98 0xBA (known to friends as U+263A, the smiley face) and replace them with
<img src="/emoji/smiley_face.gif" class="emoji" />, or whatever HTML we wish to use.
The unified set is pretty big, covering 772 characters. Performing a search and replace at display time is going to take a couple of milliseconds, so for heavily trafficked applications, we might need to start worrying about performance. Luckily there are a few tricks.
Assuming most of your content doesn't contain emoji, you can add an extra field to your database rows called
contains_emoji. By setting this at insert time, you can only spend time transcoding the emoji at output time if there are actually any in there.
If your application has lots and lots of emoji content, then storing multiple versions of your content with different emoji formats in might make sense. If you know you're primarily serving to DoCoMo users, then store both a unified and a DoCoMo copy. When you need to serve to DoCoMo, use that version, else convert from the unified version.
I've put together a simple PHP library to abstract all of the messiness away. Just include it in your application and start detecting emoji right away. Instructions are included in the readme file.
While I'm a bit of a Unicode nerd, I'm no expert. The above article may contain mistakes or fail to mention very important things. If you spot any glaring mistakes or omissions, drop me an email and teach me: cal [at] iamcal.com
Copyright © 2009 Cal Henderson.
The text of this article is all rights reserved. No part of these publications shall be reproduced, stored in a retrieval system, or transmitted by any means - electronic, mechanical, photocopying, recording or otherwise - without written permission from the publisher, except for the inclusion of brief quotations in a review or academic work.
All source code in this article is licensed under the GPL v3.