UTF-8: Why Your Default Choice Might Not Always Be Right
Tom Wetjens
UTF-8 has become the de facto standard for text encoding, with 97% of websites using it by default. Most developers reflexively choose UTF-8 without understanding what they’re actually getting—or what they might be missing. This talk dives deep into how UTF-8 actually works under the hood, from its brilliant variable-length design to its backward compatibility with ASCII, and explores the engineering trade-offs that make it both a triumph and occasionally the wrong choice.
We’ll examine UTF-8’s elegant bit patterns, understand why it won the encoding wars against UTF-16 and UTF-32, and discover the surprising scenarios where other encodings might serve you better: embedded systems where UTF-32’s fixed width enables faster processing, Asian language applications where UTF-16 can be more space-efficient, or legacy systems where Latin-1 is still king.
Through real-world examples—from PostgreSQL’s encoding decisions to JavaScript’s UTF-16 strings causing emoji bugs—you’ll gain the deep understanding needed to make informed encoding choices rather than following convention blindly.