Have you ever wondered why certain characters don’t seem to show up right on your phone or within your web-browser? When what should ostensibly should be letters, numbers or even, yes, an emoji shows up as a dreaded square? While sending your messages in English or French may not ever be issue, for some users their phone is a major obstacle towards sending and receiving messages properly in their language of choice.
For most of human history, tofu, or the blank squares standing in lieu of properly rendered letters, simply did not exist. While the origins of human writing remain opaque, what is clear is that once we did have writing systems, being able to read and write sufficed to allow one to graphically transmit linguistic messages with success. You simply took out your writing utensil and applied it to the canvas of your choice. In the digital age of today however, literacy does not necessarily translate into the ability to send or receive words in one’s own tongue.
This stems from the fact that digital messages are ultimately mediated by digital translations of what most humans ultimately decode simply by looking at a screen. When you send a text message it is actually an electrical signal composed of 1s and 0s that must be interpreted by your friend’s phone. Computer programmers refer to this signal of information as a string. For your friend’s computing device (that is, their new iPhone) to render your message into a language in which they are literate, it must have a means of decoding strings. To do this, computing devices rely on an encoding or a set of rules for parsing strings into a set of characters. While the word “character” is popularly used to refer to letters and numbers, for a computer programmer it is a technical term that at its most basic designates a particular numeric sequence. For this reason, they are also sometimes referred as a code point. A hypothetical illustration may be useful:
Does this explain why tofu appears on your cellphone screen? Do you simply lack the proper encoding to interpret the original written message? It could, but in practice today that is rarely the issue thanks to Unicode. In California in 1991 a group of computer programmers and technologists gathered to devise a single encoding system that they hoped would be adopted universally across devices. This issue came to the fore because as computer usage spread beyond the Western circles where it originally flourished, it became apparent that the original character encodings were not sufficient for the world’s languages. The dominant predecessor to Unicode for instance, ASCII was constrained by the number of digits that it had available for characters. As with phone and social security numbers, encodings can only go far as the maximum code permits. ASCII approached strings by looking for unique sequences of 7 digits. However the total number of code-points that this provided however was wildly inadequate in providing a sufficient number of digital placeholders for the letters and symbols of all of the world’s languages. As such, the Unicode Consortium proposed a new encoding system, Unicode that took ASCII characters as its initial starting point but also allowed for a potentially infinite number of additional code-points. Furthermore, as a non-profit dedicated to internationalizing the web, the group also laid out an official and defined process for submitting and judging new character proposals.
Why then do characters still sometimes appear as tofu and not their appropriate graphic representation if we are (almost) all using Unicode encoding? This stems from the intermediate layer of fonts. From a computer programming perspective, a font is a set of mediating instructions between a code-point or character and actually a graphic representation on your device’s screen. If this sounds a lot like something from linguistics then you are not too far off because computer programmers, typologists and orthographic developers have also seen the parallels and named their concepts in the same vein. The actual surface realization of a letter (as defined by a font) is a glyph whereas the underlying the category represented by a code-point or character is referred to as a grapheme. Thus an encoding system such as Unicode is essentially a list of code-points (e.g., 10010) assigned to the letters, symbols and numbers (that is, the graphemes) of the world’s languages. Applied to a concrete example, this would mean that a single handwritten cursive-e is particular glyph (written in let’s say “hand-written cursive font” — no, not available online) stemming from the grapheme <e>.
As we all know though and as Unicode would likely emphasize, this grapheme <e> is not inherently part of any language; it is rather part of writing system or a collection of graphemes that can be put to use in developing an orthography or a set of conventions for graphically representing a particular language. Script in this usage is a near synonym for writing system though in other contexts one could argue that a script is a strictly defined set of graphemes whereas a writing system refers to a more abstract overarching category that encompasses multiple scripts that are clearly related. For instance, in Arabic one speaks of naskh (Mashriqi), nastaliq and Maghribi “scripts” which while all distinct are all acknowledged of variants of the more general Arabic script or writing system. Typologists and programmers continue to debate whether the various of scripts of Arabic merit their own code-points or whether the distinct traditionally hand-written styles could simply be covered by different fonts.
Regardless, the point being that as a human you don’t download fonts — you’ve received my handwritten note in cursive and if you have been introduced to this style of writing (or “font” to put it in computer terms) then you will read and interpret it (hopefully correctly). Presuming that someone’s handwriting or chosen style is not too deviant then you can likely read messages written in a style that you’ve never even encountered before. A computer however would hiccup in this kind of situation. While the underlying strings of code would be interpretable to the computer (that is, it could parse the strings for the code-points meant to represent the graphemes of its encoding), without a font that provides a glyph for character, it cannot cue up a visual representation for the recipient to read.
This nuance about font explains why some letters appear as tofu when typing less dominant writing systems. For instance, N’ko, a non-Latin-, non-Arabic-based script for writing Manding was recognized by the Unicode Consortium in 2004 (see this document for the official proposal) and therefore in theory any computer with proper Unicode support, which is nearly all modern machines and mobile phones, is technically capable of decoding N’ko characters. This unfortunately has not translated into actual rendering that would allow for N’ko to be digitally read and written by its users. This is because font designers frequently do not design their fonts to cover the entirety of the 120,000 Unicode characters. As such N’ko students and business have designed a number of workarounds over the years to facilitate its appearance on devices (see my previous post on writing in N’ko on Android).
But N’ko didn’t appear within apps (except Mozilla’s browser for which there was an N’ko extension) on my previous Android phone, why did it suddenly start appearing after I purchased a new Nexus 5X?
This presumably can be chalked up to Google’s Noto font initiative. Essentially Google has responded to the issue of Unicode recognized graphemes (such as N’ko’s) appearing as tofu by providing money to support the development of a series of fonts that are meant to cover ALL of the encoding’s characters. Presumably the use of Noto (yes, that comes from “No to(fu)”) fonts on my new Android device is to thank for my new ability to receive and read texts, emails and Skype messages in N’ko.
Noto fonts however do leave however one issue unresolved. While less commonly used scripts may now properly appear on people’s devices, the fonts do not mean that one can craft messages. Keyboards are entirely separate issue. Companies or individuals need to also devise input systems into their devices that allow people to easily pen texts in their chosen Unicode recognized graphemes, rendered properly thanks to Noto fonts. This isn’t so easy as far as I can tell. First, creating a keyboard however requires coding knowledge. I, for instance, have failed to be able to create an N’ko keyboard for Android in the couple of hours that I’ve dedicated to the subject (though there are two available through the Android market (the first one is missing two letters and the other is a bit small, which explains my interest in doing it at all)). Second, even developers also need knowledge of how the keyboard should be designed and where to place characters and diacritics etc. This typically requires the input of an actual user of a particular orthography because input systems are not keyed to scripts but rather to languages (you for example can switch your iPhone between English and Spanish systems without ever leaving the Latin-script). Now that encoding and font concerns are gone for many languages’ writing systems perhaps the next easily overcome technical hurdle is keyboard design. If there isn’t a big enough market for tech companies to provide the keyboard themselves and if no governments or associations are able to overcome the issue, maybe the best solution would be developing a simple tool for allowing users to design their own keyboards using Unicode encoding and Noto fonts (a script/font activist that formerly worked with N’ko students and teachers set up a page that explains how to do just that but unfortunately for desktops only).
I welcome all comments re: the errors that I’ve made re: UTF-8 versus UTF-16 and other tech things that I don’t understand or get right. Also I welcome comments about everything else I’ve gotten wrong. Or maybe you can help me and others make keyboards?