set features {
de {ä ö ü ß ß ch ck sch ei en ge ung w}
en {th y sh ch in ing ed en ns rs w with from and}
es {á é ión dad es ll ch qu j ya os as üe con ya}
fr {é é ê è ch ei eu iè oi x y qu au ou de d' et ir te}
is {á í ú ó ý é þ æ ð ö}
ga {an á bh bh dh gc iai iai mb mh n- ó uai uai ú}
it {è è ù da " e " re di il gl io ione ioni cc cch z tt una qu}
ms {ah ak j jan ke ngan pu se uan ang ber ku}
nl {ij ij ing tj sj z aa ee ei ou oe oe met ge sch te baar}
pt {lhã lhõ lha lhe lh nha nhã nho nhõ nhe nt ndo de ão ões ss rr os as ch ça ção ções ico ns apro}
}# LES gets carried away and suggests:# pt {" de " " do " " da " " os " " as " " e " " que " " em " " nos " " nas " " na " "de " "te " " se " " às " " aos " " com " lhã lhõ lha lhe lh nha nhã nho nhõ nhe ndo ão õe ãe ssa sse ssi sso ssu rra rre rri rro rru ch ça ção ções ico}
# ''(removed: de nt os as ns apro ss ões)''The identification code itself is fairly straightforward:proc lang'identify {string features} {
set res {}
foreach {language regexps} $features {
set score 0
foreach regexp $regexps {
incr score [regexp -all -nocase $regexp $string]
}
if $score {lappend res [list $language $score]}
}
set t [lsort -decreasing -integer -index 1 $res]
}But most important is the testing. I collected sample phrases from food packages, which are often pretty multilingual in Germany, and other sources. I kept adding sample strings, and tuning the above feature set until the target language came first most of the time (short strings can't always be identified right):KPV: If you need examples in various unusual languages you could use Google's advanced search page and specify different languages to search for.set ntests 0
set nproblems 0
set score 0
foreach {language string} {
de {Dies ist ein Beispiel für einen deutschen Satz}
de {Knusprige Weizenflocken mit Schokoladengeschmack}
de {Waffeln mit feiner Haselnusscremefüllung}
de {Vor dem Öffnen bitte schütteln}
de {Trocken lagern, vor Licht schützen}
de {Sicherheitsinformationen für Netzkabel und Zubehör}
de {Auf dieses Produkt wird eine zwölfmonatige Garantie gegeben}
en {This is an example for an English sentence}
en {Wafers filled with hazelnut creme}
en {Store in a dry place, protect from light}
en {Safety precautions for power cords and accessories}
en {This product is warranted for the period of twelve months}
en {Worldwide telephone numbers}
en {For continuous quality improvement, calls may be monitored}
es {Él que no espera vencer, ya está vencido}
es {Copos de trigo tostados con chocolate}
es {Barquillos rellenos de crema de avellanas}
es {Precauciones de seguridad para cables de alimentación y accesorios}
es {Este producto está garantizado por un período de doce meses}
es {Esta garantía no cubre ninguno de los siguientes casos}
es {En caso necesario, la lista de nuestros Servicios Autorizados
está disponible}
fr {Voilà un autre exemple pour une phrase francaise}
fr {Gaufrettes fourrées à la noisette}
fr {Agiter avant d'ouvrir}
fr {Conserver au réfrigerateur une fois ouvert et consommer dans les
jours qui suivent}
fr {Protéger contre la lumière}
fr {A consommer de préférence avant fin:}
fr {Précautions de sécurité concernant les cordons d'alimentation}
fr {Cet appareil est couvert par une garantie de douze mois}
ga {Eolas an Chuairteora do Láithreáinin Oidhreachta}
ga {Tá roinnt bríonna leis an bhfocal Dúchas}
ga {Tugann na milliúin daione cuairt ar ár n-ionaid gach bliain}
ga {Tá na hAmanna oscailte sa bhfoilseachán seo i gceart ag am priondála}
ga {Ionad Cuairteora Pháirc an Fhionnuisce}
ga {Tógadh an caisleán le linn na mblianta 1870-73}
ga {Deirtear gur tógadh an foirgneamh is sine anseo 400 bliain ó shin}
it {Questo è un altro esempio per una frase italiana}
it {Fiocchi di frumento al cioccolato}
it {Wafers ripieni di crema alla nocciola}
it {Agitare prima di aprire}
it {Da consumare preferibilmente entro il:}
it {Una volta aperto tenere in frigo e consumare entro qualche giorno}
it {Precauzioni relative alla sicurezza per i cavi di alimentazione}
it {Questo prodotto è garantito per un periodo di dodici mesi}
it {Se non è vero, è ben trovato}
ms {Pendahuluan cetak yang dibarahui}
ms {Dia pun hendak ikut saya ke kedai}
ms {Orang itu pun membuat kerjanya dengan cepat}
ms {Pukul sembilan setengah malam}
ms {Jepun pun kalah dalam pertandingan bolasepak Piala Merdeka}
ms {Meja ini baik, tetapi meja itu pun baik juga}
ms {Cik Pun pun turut serta dalam pertandingan itu}
nl {Het is tijd om op te staan, vandaag is het zaterdag}
nl {Wafeltjes met hazelnootcrèmevulling}
nl {Tegen licht beschermen. Tenminste houdbaar tot einde:}
nl {Dit produkt is gegarandeerd voor een periode van twaalf maanden}
nl {De garantie is alleen geldig wanneer de garantiekaart volledig is ingevuld}
nl {In probleemgevallen kunt U nadere informatie verkrijgen}
nl {Reparaties onder garantie moeten door servicecentra worden uitgevoerd}
pt {Alguns quatrilhões de ítens de informação, formando amostragens de}
pt {Essas pulsações eram armazenadas em mnemocircuitos idênticos}
pt {Na praça da cidade, a fila tinha se formado às cinco da manhã com os}
pt {sentido humorístico. Tinha um fim. Estava armado da lista telefônica.}
pt {aproximar da Estação Ether e poderei aproveitar a caminhada para}
} {
incr ntests
set res [join [lang'identify $string $features]]
if {[lindex $res 0] ne $language} {
puts "$string\n $res **** should have been: $language"
incr nproblems
} elseif {[llength $res]==2} {
incr score [lindex $res 1]
} elseif {[lindex $res 1] > [lindex $res 3]} {
incr score [expr {[lindex $res 1]-[lindex $res 3]}]
} else {
puts "$string\n $res **** ambiguous: $language"
incr nproblems
}
}
puts "score: [expr 30*$score/$ntests] Passed:[expr $ntests-$nproblems]/$ntests"GS (040612) Some hints to go further at the Gertjan van Noord web site [1].
slebetman The opening line of this common Malay rhyme fails the test and is identified as english with a score of {en 2} {nl 1}:
set teststring {dua tiga kucing berlari}adding "ber" and "ku" to the list of ms identifiers improves the match with a score of {en 2} {ms 2} {nl 1}. However the full rhyme passes the original test with a score of {ms 5} {en 4} {ga 4} {nl 2} {it 1}. Here's the full rhyme:set teststring {
dua tiga kucin berlari,
mana nak sama si kucing belang,
dua tiga boleh ku cari,
mana nak sama si adik seorang.
}The addition of "ber" and "ku" improves the result further with a score of {ms 9}. So I added them to the features list above.you can get utf-8 samples of a boatload of languages here: http://unicode.org/udhr/


