Sometimes you need to downgrade Unicode text to more restricted ASCII text. For example, while working on my previous post, I was surprised that there didn’t appear to be an asteroid named after Poincaré. There is one, but it was listed as Poincare in my list of asteroid names.

## Python module

I used the Python module unidecode to convert names to ASCII before searching, and that fixed the problem. Here’s a small example showing how the code works.

import unidecode for x in ["Poincaré", "Gödel"]: print(x, unidecode.unidecode(x))

This produces

Poincaré Poincare Gödel Godel

Installing the `unidecode`

module also installs a command line utility by the same name. So you could, for example, pipe text to that utility.

As someone pointed out on Hacker News, this isn’t so impressive for Western languages,

But if you need to project Arabic, Russian or Chinese, unidecode is close to black magic:

>>> from unidecode import unidecode >>> unidecode("北亰") 'Bei Jing '

(Someone has said in the comments that 北亰 is a typo and should be 北京. I can’t say whether this is right, but I can say that `unidecode`

transliterates both to “Bei Jing.”)

## Projections

I titled this post “Projecting Unicode to ASCII” because this code is a **projection** in the mathematical sense. A projection is a function *P* such that for all inputs *x*,

*P*( *P*(*x*) ) = *P*(*x*).

That is, applying the function twice does the same thing as applying the function once. The name comes from projection in the colloquial sense, such as projecting a three dimensional object onto a two dimensional plane. An equivalent term is to say *P* is **idempotent**. [1]

The `unidecode`

function maps the full range of Unicode characters into the range 0x00 to 0x7F, and if you apply it to a character already in that range, the function leaves it unchanged. So the function is a projection, or you could say the function is idempotent.

Projection is such a simple condition that it hardly seems worth giving it a name. And yet it is extremely useful. A general principle in user interface to design is to make something a projection if the user expects it to be a projection. Users probably don’t have the vocabulary to say “I expected this to be a projection” but they’ll be frustrated if something is almost a projection but not quite.

For example, if software has a button to convert an image from color to grayscale, it would be surprising if (accidentally) clicking button a second time had any effect. It would be unexpected if it returned the original color image, and it would be even more unexpected if it did something else, such as keeping the image in grayscale but lowering the resolution.

## Related posts

- Contractions and fixed points
- Area of a triangle and its projections
- Converting between Unicode and LaTeX

[1] The term “idempotent” may be used more generally than “projection,” the latter being more common in linear algebra. Some people may think of a projection as *linear* idempotent function. We’re not exactly doing linear algebra here, but people do think of portions of Unicode geometrically, speaking of “planes.”

This post reminds me of the (what is probably old enough to be called a) Spotify case study, where accounts were hijacked because of an underlying function was guaranteed to be idempotent (and therefore assumed to be), but only for a certain domain of inputs. When this condition failed, equality checks such as P(P(x)) = P(x) failed and this was taken advantage of to reset passwords.

https://labs.spotify.com/2013/06/18/creative-usernames/

TYPO: It’s “北京”, not “北亰”.

北京 is right, Chinese here, live in Beijing.

Also see wikipedia: https://en.wikipedia.org/wiki/Beijing ,notice the top right on this page.

Thanks. What does the typo mean? Is it reasonable to also map it to “Bei Jing”?

They are the same character in some sense. 京 is the newer simplified form and is the correct one to use in the name of the city, Beijing.

https://en.m.wikipedia.org/wiki/%E4%BA%AC

It would be similar to spelling Austin as Auſtin. Same sound and meaning, but one is no longer used.

The correct projection for ‘Gödel’ is ‘Goedel’.

I do understand that this here is about finding the “base rune” that is available as ASCII char from a given unicode rune, but that doesn’t solve the real problem; it merely solves a purely technical problem.

Btw, unicode imo isn’t a solution but a problem. I think that a 16-bit based system that would provide at least the commonly used letters or symbols (Chinese, for instance, has plenty of symbols that are not used in everyday life) plus common special symbols (e.g. basic math, currencies) would have been much much better.

For all ancient language runes or highly specilialized scientific runes, musical ones, etc. one could have created additional planes. Having a single code in the “base plane” as well as all additional planes would allow for switching (…). Such, one could, if needed, (e.g. in scientific papers) simply switch from the base plane to another back (and back).

The module documentation addresses the transliteration of German umlauts in the FAQ section: