A simple Arabic renderer

Richard Suchenwirth 2000-04-27 -- Since 8.1, Tcl offers us Unicodes (cf. Unicode and UTF-8) and thus the potential of working with most writing systems of the world (see The Lish family for my set of transliterators). Sometimes, some additional work is required, though. In the case of Arabic, you can easily express a text as a sequence of character from page 06, but this looks horrible to an Arab reader - running from left to right, all isolated letter forms, etc. The process of "bidi rendering" of Arabic thus needs additional steps:

revert writing direction to right-to-left
don't do this on numbers, though
depending on context (preceding/succeeding letter), for each "abstract" letter select one of two or four shape variants (glyphs) which are happily also in the Unicode, page FE)
substitute punctuations (parentheses go the other way round!)
optionally substitute "Euro-arabic" digits with "Indo-arabics"
line-breaks need much more treatment, otherwise Arabic text seems to run bottom-up!
optionally (if the font allows), use multi-letter ligatures

Here's my simple renderer that does these at least partially (not the line-break stuff, max. one number per call):

proc buckwalter2uc {s} {
# The Buckwalter transliteration is used by Xerox, ArabTex, and documented at
# http://www.cis.upenn.edu/~cis639/arabic/info/buckwalter-about.html

   array set tbl {
      ' \u0621 | \u0622 > \u0623 & \u0624  < \u0625 \} \u0626
      A \u0627 b \u0628 p \u0629 t \u062A  v \u062B j \u062C H \u062D x \u062E
      d \u062F * \u0630 r \u0631 z \u0632  s \u0633 $ \u0634 S \u0635 D \u0636
      T \u0637 Z \u0638 E \u0639 g \u063A  _ \u0640 f \u0641 q \u0642 k \u0643
      l \u0644 m \u0645 n \u0646 h \u0647  w \u0648 Y \u0649 y \u064A
      F \u064B N \u064C K \u064D a \u064e  u \u064F i \u0650 ~ \u0651 o \u0652
      ` \u0670 \{ \u0671
      c \u0634 V \u0630 
   } ;# Abdullah Al-Zaid: c, V
   set res ""
   foreach i [split $s ""] {
        if [info exists tbl($i)] {append res $tbl($i)} else {append res $i}
   }
   set res
}
proc string:revert s {
        set res ""
        foreach i [split $s ""] {
            if {$i==")"} {set i (} elseif {$i=="("} {set i )} 
            set res $i$res
        }
        set res
}

This proc does the real job, especially the glyph selection. This is implemented as as series of regsubs. An abstract character offers its connectivities by commas to left and right. Sequences of two commas after substitution denote the fact that the two surrounding characters are connected. Glyphs are substituted in the order ,,x,, ,,x x,, x. Finally, the commas are removed.

 proc uc:arabchar2glyph {s {arnum ""}} {
    set s2 [list]
    foreach i [eval list [string:revert $s]] {
        if [regexp {[0-9\u0660-\u0669][- .,0-9\u0660-\u0669]*[0-9\u0660-\u0669]} $i] {
            lappend s2 [string:revert $i] 
        } else {
            lappend s2 $i
        }
    } ;# modified: handle more than one number per string correctly
    if [string length $arnum] {
        foreach {i j} {
            0 \u0660 1 \u0661 2 \u0662 3 \u0663 4 \u0664
            5 \u0665 6 \u0666 7 \u0667 8 \u0668 9 \u0669
        } {
            regsub -all "\[$i\]" $s2 $j s2
        } ;# optional: indo-arabic digits
    }        
    foreach {i j} { , \u066C ? \u061F} {
        regsub -all "\[$i\]" $s2 $j s2
    } ;# special characters
    foreach i {
        \u0622 \u0623 \u0624 \u0625 \u0627 \u0629 \u062F \u0630 \u0631 \u0632 
        \u0648 \u0649
    } {        regsub -all $i $s2 $i, s2
    } ;# joining right only
    foreach i {
        \u0626 \u0628 \u062A \u062B \u062C \u062D \u062E \u0633 \u0634 \u0635 
        \u0636 \u0637 \u0638 \u0639 \u063A \u0640 \u0641 \u0642 \u0643 \u0644 
        \u0645 \u0646 \u0647 \u064A 
        \u064B \u064C \u064D \u064E \u064F \u0650 \u0651 \u0652 \u0670 \u0671
    } {
            regsub -all $i $s2 ,$i, s2
    } ;# joining both sides
    foreach {i j} {
        \u0622,,\u0644,, \uFEF6,,  \u0622,,\u0644 \uFEF5
        \u0622,, \uFE82 \u0622 \uFE81
        \u0623,,\u0644,, \uFEF8,,  \u0623,,\u0644 \uFEF7
        \u0623,, \uFE84 \u0623 \uFE83
        \u0624,, \uFE86 \u0624 \uFE85
        \u0625,,\u0644,, \uFEFA,,  \u0625,,\u0644 \uFEF9
        \u0625,, \uFE88 \u0625 \uFE87
        ,,\u0626,, ,,\uFE8C,, \u0626,, \uFE8A,, ,,\u0626 ,,\uFE8B \u0626 \uFE89 
        \u0627,,\u0644,, \uFEFC,,  \u0627,,\u0644 \uFEFB
        \u0627,, \uFE8E,, \u0627 \uFE8D
        ,,\u0628,, ,,\uFE92,, \u0628,, \uFE90,, ,,\u0628 ,,\uFE91 \u0628 \uFE8F 
        \u0629,, \uFE94,, \u0629 \uFE93
        ,,\u062A,, ,,\uFE98,, \u062A,, \uFE96,, ,,\u062A ,,\uFE97 \u062A \uFE95 
        ,,\u062B,, ,,\uFE9C,, \u062B,, \uFE9A,, ,,\u062B ,,\uFE9B \u062B \uFE99 
        ,,\u062C,, ,,\uFEA0,, \u062C,, \uFE9E,, ,,\u062C ,,\uFE9F \u062C \uFE9D 
        ,,\u062D,, ,,\uFEA4,, \u062D,, \uFEA2,, ,,\u062D ,,\uFEA3 \u062D \uFEA1 
        ,,\u062E,, ,,\uFEA8,, \u062E,, \uFEA6,, ,,\u062E ,,\uFEA7 \u062E \uFEA5
        \u062F,, \uFEAA,, \u062F \uFEA9
        \u0630,, \uFEAC,, \u0630 \uFEAB
        \u0631,, \uFEAE,, \u0631 \uFEAD
        \u0632,, \uFEB0,, \u0632 \uFEAF
        ,,\u0633,, ,,\uFEB4,, \u0633,, \uFEB2,, ,,\u0633 ,,\uFEB3 \u0633 \uFEB1 
        ,,\u0634,, ,,\uFEB8,, \u0634,, \uFEB6,, ,,\u0634 ,,\uFEB7 \u0634 \uFEB5
        ,,\u0635,, ,,\uFEBC,, \u0635,, \uFEBA,, ,,\u0635 ,,\uFEBB \u0635 \uFEB9 
        ,,\u0636,, ,,\uFEC0,, \u0636,, \uFEBE,, ,,\u0636 ,,\uFEBF \u0636 \uFEBD 
        ,,\u0637,, ,,\uFEC4,, \u0637,, \uFEC2,, ,,\u0637 ,,\uFEC3 \u0637 \uFEC1 
        ,,\u0638,, ,,\uFEC8,, \u0638,, \uFEC6,, ,,\u0638 ,,\uFEC7 \u0638 \uFEC5
        ,,\u0639,, ,,\uFECC,, \u0639,, \uFECA,, ,,\u0639 ,,\uFECB \u0639 \uFEC9 
        ,,\u063A,, ,,\uFED0,, \u063A,, \uFECE,, ,,\u063A ,,\uFECF \u063A \uFECD 
        ,,\u0641,, ,,\uFED4,, \u0641,, \uFED2,, ,,\u0641 ,,\uFED3 \u0641 \uFED1 
        ,,\u0642,, ,,\uFED8,, \u0642,, \uFED6,, ,,\u0642 ,,\uFED7 \u0642 \uFED5
        ,,\u0643,, ,,\uFEDC,, \u0643,, \uFEDA,, ,,\u0643 ,,\uFEDB \u0643 \uFED9 
        ,,\u0644,, ,,\uFEE0,, \u0644,, \uFEDE,, ,,\u0644 ,,\uFEDF \u0644 \uFEDD 
        ,,\u0645,, ,,\uFEE4,, \u0645,, \uFEE2,, ,,\u0645 ,,\uFEE3 \u0645 \uFEE1 
        ,,\u0646,, ,,\uFEE8,, \u0646,, \uFEE6,, ,,\u0646 ,,\uFEE7 \u0646 \uFEE5
        ,,\u0647,, ,,\uFEEC,, \u0647,, \uFEEA,, ,,\u0647 ,,\uFEEB \u0647 \uFEE9 
        \u0648,, \uFEEE,, \u0648 \uFEED
        \u0649,, \uFEF0,, \u0649 \uFEEF
        ,,\u064A,, ,,\uFEF4,, \u064A,, \uFEF2,, ,,\u064A ,,\uFEF3 \u064A \uFEF1 
         
      } {
      if [regsub -all $i $s2 $j s2] {
          #text:add $s2\n
      }
   }
   regsub -all , $s2 "" res
   set res
}
proc ar:ligatures s {
    # input: a rendered Arab Unicode string (context forms)
    # applies those optional ligatures contained in Bitstream Cyberbit
    foreach {from to} {
        \uFEA4\uFEE4\uFEDF \uFD88
        \uFEEA\uFEE0\uFEDF \uFDF2
        \uFEE2\uFE91 \uFC08        \uFEE2\uFE97 \uFC0E
        \uFEF2\uFED3 \uFC32        \uFE9E\uFEDF \uFC3F
        \uFEA2\uFEDF \uFC40        \uFEA6\uFEDF \uFC41
        \uFEE2\uFEDF \uFC42        \uFEF0\uFEDF \uFC43
        \uFEF2\uFEDF \uFC44        \uFEE2\uFEE7 \uFC4E
        \uFEAE\uFE92 \uFC6A        \uFEE6\uFE92 \uFC6D
        \uFEF2\uFE92 \uFC6F        \uFEAE\uFE98 \uFC70
        \uFEE6\uFE98 \uFC73        \uFEF2\uFE98 \uFC75
        \uFEF2\uFEE8 \uFC8F        \uFEAE\uFEF4 \uFC91
        \uFEE6\uFEF4 \uFC94        \uFEA0\uFE91 \uFC9C
        \uFEA4\uFE91 \uFC9D        \uFEA8\uFE91 \uFC9E
        \uFEE4\uFE91 \uFC9F        \uFEA0\uFE97 \uFCA1
        \uFEA4\uFE97 \uFCA2        \uFEA8\uFE97 \uFCA3
        \uFEE4\uFE97 \uFCA4        \uFEE4\uFE9B \uFCA6
        \uFEE4\uFE9F \uFCA8        \uFEE4\uFEA3 \uFCAA
        \uFEE4\uFEA7 \uFCAC        \uFEE4\uFEB3 \uFCB0
        \uFEA0\uFEDF \uFCC9        \uFEA4\uFEDF \uFCCA
        \uFEA8\uFEDF \uFCCB        \uFEE4\uFEDF \uFCCC
        \uFEEC\uFEDF \uFCCD        \uFEA0\uFEE3 \uFCCE
        \uFEA4\uFEE3 \uFCCF        \uFEA8\uFEE3 \uFCD0
        \uFEE4\uFEE3 \uFCD1        \uFEA0\uFEE7 \uFCD2
        \uFEA4\uFEE7 \uFCD3        \uFEA8\uFEE7 \uFCD4
        \uFEE4\uFEE7 \uFCD5        \uFEA0\uFEF3 \uFCDA
        \uFEA4\uFEF3 \uFCDB        \uFEA8\uFEF3 \uFCDC
        \uFEE4\uFEF3 \uFCDD        \uFEE4\uFEB7 \uFD30
    } {
        regsub -all $from $s $to s
    }
    set s
}

And finally, a nice wrapper that supplies a default test text, and handles the -digits switch for Indo-arabic numbers, and -lig for ligatures:

proc arblish args {
     set convertdigits ""
     set ligatures 0    
     if {$args==""} {set args "bsm Allh AlrHmn AlrHym"}
     if [regsub -- -dig(its)? $args "" args] {set convertdigits -digits}
     if [regsub -- -lig $args "" args] {incr ligatures}
     set res [uc:arabchar2glyph [buckwalter2uc $args] $convertdigits]
     if $ligatures {set res [ar:ligatures $res]}
     set res
}

... and an even briefer wrapper around that wrapper:

proc ar args {eval arblish $args}

Now you're set for an Arblish example. Read that file into a string, subst it, and voila!

54293 % arblish "bsm Allh AlrHmn AlrHym"
&#65162;&#65252;&#65268;&#65187;&#65198;&#65247;&#65165; &#65254;&#65252;&#65187;&#65198;&#65247;&#65165; &#65258;&#65248;&#65247;&#65165; &#65250;&#65204;&#65170;

This is the Tcl Wiki, but I have to point out that Roman Czyborra has a nice renderer in Perl at http://czyborra.com/arabjoin/arabjoin

VK 23-mar-2005 I very doubt this conforms to Unicode Standard Annex #9 - The Bidirectional Algorithm at [1], I very doubt it does things right. - RS: It's an approximation. The multiple nesting discussed in the Bidi algorithm isn't provided. But concrete bug reports are always welcome :)

See also:

Arabic Character Renderer For Readability In TCL/Tk

Category Application

Category Characters

Category Human Language