Machine Learning

Understanding Character Encoding: Why Your Emojis Sometimes Show Up as Boxes

May 19, 2025

Have you ever wondered how computers understand letters and symbols when all they can process are numbers? In this blog, we'll explore character encoding, a critical concept that allows computers to interpret human language. We'll discuss why encoding is necessary, the journey from ASCII to Unicode, and why UTF-8 is now dominant.

Computers Only Understand Numbers

At their core, computers are glorified calculators; everything they “see” is just streams of 0s and 1s. So when you hit the “A” key, that press gets translated into the number 65. Hit “a,” and the computer logs 97. Even the space between your words is a number (32). These mappings from characters to numbers are the character encoding we are talking about here. They let software agree on exactly which bit pattern means “A” versus “a” versus “!” and so on.

Early pioneers each had their own favorite mappings. One company’s “A” wasn’t always another’s “A,” so exchanging text between systems was like speaking two dialects and hoping they understood each other. Standardization (early in the form of ASCII) brought everyone onto the same page. Well, at least for about 128 characters only.

But technology marches on, and today’s world needs to support hundreds of thousands of symbols: Arabic letters andtashkeel, accented letters (é, ü), Chinese characters (你), and even emojis. That’s where Unicode comes in, which is basically a massive catalog assigning every character its own unique code point. However, there are different approaches to this as we will discuss later. And some approaches are smarter than others.

When Encoding Goes Wrong

Imagine you send a friend a cheerful “🎉” emoji, but when they open it, all they see is a little empty box “�”.That’s your phone speaking one type of encoding, but their device only “speaks” an older encoding that doesn’t know the code point for 🎉. Since there’s no match, the computer defaults to a placeholder (a sad empty box, sometimes called “tofu.”) We don’t see these little boxes nowadays as much as before though, thanks to Unicode as we’ll discuss further down.

Encoding mismatches can lead to funny glitcheslike long strings of �� popping up in place of special punctuation, or entire paragraphs of Japanese text turning into nonsense. That’s why modern systems have largely settled on UTF-8, a clever way to pack every Unicode character into a sequence of bytes that keeps older ASCII code points intact. No more mystery boxes, just smooth multilingual, emoji-filled conversations.

ASCII: Where It All Started

In the 1960s, the American Standard Code for Information Interchange (ASCII) was developed. ASCII is a 7-bit encoding scheme providing 128 possible character representations. This covered:

English alphabet (A–Z and a–z)
Numbers (0–9)
Common punctuation and control characters (new line, backspace, etc.)

ASCII standardized characters universally, meaning a file created in Rihal’s office in Oman could be understood exactly the same way on any computer worldwide.

However, ASCII only catered to English. Thereforethe word “Rihal” can be encoded in ASCII but not “رحال” and that was the biggest limitation of ASCII; it only supporta very limited character set.

Expanding Beyond ASCII: The Chaos of Multiple Encodings

To support global languages, additional encoding standards emerged. ISO 8859-1 (Latin-1), for instance, extended ASCII to include European characters like "é" and "ç". Similarly, ISO 8859-6 introduced Arabic characters.

Yet, the proliferation of standards led to confusion. Imagine a report from Rihal containing Arabic text, created using ISO 8859-6, but opened by software expecting ISO 8859-1; the text would appear corrupted.

This confusion highlighted the urgent need for a unified encoding system.

Unicode: One Encoding for All

To address encoding fragmentation, Unicode was introduced in the late 1980s as a universal encoding system. Unicode assigns a unique numeric code (code point) to every character across languages, from the Latin "A" (U+0041) to Arabic letters like "ش" (U+0634), and even emojis like "😀" (U+1F600).

Unicode streamlined global digital communication. Today, whether you're reading an official document from Oman’s government or an email from your fellow Rihal’s employee, Unicode ensures all devices read those the same way.

UTF-8: Encoding Unicode Efficiently

While Unicode assigns numbers, UTF-8 (Unicode Transformation Format) efficiently translates these numbers into actual bytes. UTF-8 is a variable-length encoding:

1 byte: For basic ASCII characters (0–127), like "A".
2 bytes: For characters like "é" or Arabic letters such as "م".
3 bytes: For more complex scripts, such as Chinese or Japanese.
4 bytes: For rare symbols and emojis.

UTF-8 cleverly distinguishes multi-byte characters by setting specific binary patterns, ensuring backward compatibility with ASCII.

Additionally, UTF-8’s variable-length nature makes it highly space-efficient. For example, the word “Oman” (U+004F, U+006D, U+0061, U+006E) requires only 4 bytes in UTF-8 (1 byte per character), whereas in UTF-32 it uses 16 bytes (4 bytes each). In contrast, the Arabic name “عمان” (U+0639, U+0645, U+0627, U+0646) is 8 bytes in UTF-8 (2 bytes per character) but still costs 16 bytes in UTF-32. This efficiency is especially valuable for multilingual web pages that mix English and Arabic content.

Why UTF-8 Dominates

UTF-8 quickly became dominant due to several advantages:

Backward Compatibility: if a text was already written before in ASCII encoding, it will still work with UTF-8
Space Efficiency: Frequently used characters consume fewer bytes, just as showed above in the example of encoding “Oman” and “عمان”
Global Standard:Imagine writing a software that only works with one language or region-specific encoding; maintaining and scaling it would be a nightmare. UTF-8 solves this by being a universal format supported by nearly all modern systems, allowing developers at Rihal and beyond to build applications that seamlessly handle multilingual content and other characters.

Legacy Encodings Still in Use

Although UTF-8 is now the global standard, older encodings such as ISO 8859-6 (for Arabic) or Windows-1256 haven’t disappeared entirely. They still linger in legacy systems, archived documents, and older databases, especially those built before Unicode’s widespread adoption. Engineers may run into these formats when migrating old systems, scraping historic data, or integrating with outdated software. Understanding how to detect and convert these encodings to UTF-8 remains a valuable skill in modern software development.

Conclusion

Character encoding is the invisible framework that allows computers to understand and display human language correctly. What began with ASCII has grown into the powerful and universal Unicode system, with UTF-8 leading the way as the most widely used encoding today. It enables seamless communication across languages, platforms, and devices, whether you're building a local Arabic website or a global application.

So, the next time you come across a file filled with weird symbols or empty squares, don’t panic. It’slikely just a mismatch in encoding, and now, you understand exactly what's going on behind the scenes.

View other blogs