Google Search

Saturday, April 25, 2009

BiDi – What is it, why we need it

BiDi stands for Bi-Directional text. Yes, specifically text. In some places (such as the Word text editor) this is referred to as Complex Scripts.

So what exactly is it? It is the case where several languages of different writing directions are mixed in a single sentence. More than that, in some cases, this is the case where a Right-To-Left language (Hebrew, Arabic and Persian) needs to be rendered to screen.

Why is this a problem? This is a problem because there is a dilemma of how to store textual data that contains both directions of writing; On the one hand, it can be stored visually, just like it appears on the screen. It will make it much easier to render, and save us from all the burden and improve performance of the rendering phase. On the other hand, however, it will make our writing phase much harder, and will require us to write some of the text reversed. Actually this is exactly what people where doing before the age of Logical Text and the BiDi algorithm.

The BiDi algorithm brought a new age of Logical Text, and by that we mean the way the text is stored in memory. Instead of storing it the way it is rendered, we store it the way it is being read and written. When we read text that contains several text directions (Hebrew and English mixed, for example), we change our reading direction every time the language is changed. When we write text that contains several directions, we still write the first letters first, and last letters last, even when the language changes. We don’t reverse the order of characters for that task.

Storing the text logically makes much more sense, and makes it easier to develop software that handles text. The software behaves with text in the same way we think about it. However, in order to draw the characters on the screen in their correct position, we first need to reorder them.

Rendering text is not an easy task. Rendering each line requires all the letters (and other characters) to appear in their correct positions, since we usually want the renderer to make one pass due to efficiency reasons. Before that, if we want our break our lines to fit in a given width, we need to calculate the widths of the characters and select where to break the line logically, as doing it visually might yield an unreadable result.

The reordering is done in the line level, after the text has been broken to lines. The BiDi algorithm takes each line and returns a new line with all the characters reordered, ready to be rendered.

The process of reordering can be controlled by inserting special Control Characters to the text. The most common ones define literally the direction of the text, even if the characters themselves implies differently. You can try it yourself: try to launch notepad, write some text, then click the right mouse button, in the context menu go to “Insert Unicode control character”, and select one of the control characters available. Experiment with it a little, see how it effects the text from that point on.

The BiDi algorithm is a one way algorithm. There is no reversed algorithm, and it is not guaranteed that running the same algorithm on visually ordered text would give the correct logically ordered text, since the visually ordered text could be the result of control characters usage, that might have been lost after the conversion. Thinking deeply about it, you don’t really need to convert back. If you do, think again, you’ve got something wrong in your overall design (this is always true, except for the case where you have INPUTS from a legacy device that can’t be altered, and those inputs are ordered VISUALLY. I will try to deal with this case in another post, if requested).

For further reading, the official Unicode website contains all Unicode information needed, including the BiDi algorithm.

You can find my implementation for that algorithm,called “NBiDi”, written in C#, targeting .Net and currently being used to implement Right-To-Left Silverlight controls.

kick it on

1 comment:

  1. Hebrew encoding has always been a challenge over the years. Especially here in the states when you had people writing an email in Visual Hebrew and receiving an email in Logical Hebrew, and different software programs reading the mail, and font issues. What a mess. This should have been resolved years ago, but it still lingers around slightly.