KBK 2007-06-29 The problem with the remove diacritic page is that the testing for "valid" UTF-8 is intentionally overzealous. When I reviewed the damaged pages, a great many of them contained the dreaded "double encoding" - ISO8859-1 expanded to UTF-8, with the result interpreted as ISO8859-1 and expanded to UTF-8 a second time. The result of this "double encoding" is that a character such as é (\u00e9) would be expanded into the two-byte UTF-8 sequence C3 89, and that sequence would be interpreted as the spurious combination \u00c3\u0089. The page in question was, as far as I can tell, the only case of either of the characters \u00c2 (upper-case Latin letter A with circumflex) and \u00c3 (upper-case Latin letter A with tilde) appearing on the Wiki other than as the result of this process; these two characters are extremely uncommon even in natural languages that use them. (French, for instance, often omits accents from capital letters other than É.) So it seemed wise to reject these two characters, rather than having, say, broken browsers silently convert ü to the presumptively valid pair of characters \u00c3\u00bc (upper-case Latin letter A with tilde followed by the vulgar fraction ¼).Given the large number of browsers out there that appear to get it wrong, I really don't know what else to do. I'm open to suggestions.LV Perhaps in the case where there is a possibility of a character being correct, the user should be prompted with an "are you certain" type prompt.Lars H: Try adding a hidden field (like the O field used for page versions to detect edit conflicts) to the edit page form, which contains some non-ASCII characters (e.g. those occurring in the page already). If the browser gets it wrong for the text to edit, there's a fair chance it gets all form field wrong in the same way. Since the server can know what went out in this extra field, it can verify that it gets the same thing back.Hmm... Looking at the code for this edit, there is a hidden item named _charset_ which doesn't appear to have any value:
<input type='hidden' name='_charset_'>Is this an incomplete implementation of the idea I propose?Lars H: My edit #124 was bad -- attempting repair. Oddly, this browser (Safari) didn't have the encoding problem with the old Wiki.Lars H: Edit trying to diagnose encoding problem. Will surely disturb the contents further.
jdc 29-nov-2007 : I used the following script on the wiki database to detect invalid UTF-8 sequence:
lappend auto_path /home/decoster/tcl/Wub/Utilities package require Mk4tcl package require utf8 mk::file open db wikit.tkd mk::loop i db.pages { lassign [mk::get $i name page] name page set data [encoding convertto identity $page] set point [utf8::findbad $data] if { $point >= 0 && $point < [string length $page] - 1 } { puts "\[$name\] at position $point:" puts "======" puts [encoding convertfrom identity [string range $data [expr {$point-50}] [expr {$point}]]] puts "======" } } mk::file close db exitThis reported the following pages:
bad utf8: db.pages!2957 / 9075 bad utf8: db.pages!2987 / 2143 bad utf8: db.pages!4588 / 5130 bad utf8: db.pages!8410 / 292 bad utf8: db.pages!8442 / 5608 bad utf8: db.pages!8788 / 886 bad utf8: db.pages!9112 / 4925 bad utf8: db.pages!9281 / 554 bad utf8: db.pages!12169 / 4736 bad utf8: db.pages!14525 / 2935 bad utf8: db.pages!15412 / 4059 bad utf8: db.pages!15599 / 3036 bad utf8: db.pages!19658 / 310 bad utf8: db.pages!19693 / 9485
LV Any way for the above code to display a bit of context - or is there some option in the various web browsers to display what character of the page is being displayed? It's just tough to figure out what needs to be fixed with the info here. And is that utf8 package available here on the wiki some place?jdc I updated the script so it generates wiki markup you can paste here for easier access. An example:timeentry at position 18117:
pace behavior, I suppose I could hold off on the 00
- DKF - The above pages are
- tkWorld 0.2 fixed
Tclworld fixed
Oratcl Logon Dialog fixed
Traffic lights fixed
iFile: a little file system browser fixed
iRead: a Gutenberg eBook reader fixed
Steve Redler IV fixed
A triangle toy fixed
timeliner fixed
XO fixed
Extending the eTcl console fixed
colorChooser for pocketPC/etcl fixed
Wiki UTF-8 problem test fixed
Newton-Raphson Equation Solver fixed