set features { de {ä ö ü ß ß ch ck sch ei en ge ung w} en {th y sh ch in ing ed en ns rs w with from and} es {á é ión dad es ll ch qu j ya os as üe con ya} fr {é é ê è ch ei eu iè oi x y qu au ou de d' et ir te} is {á í ú ó ý é þ æ ð ö} ga {an á bh bh dh gc iai iai mb mh n- ó uai uai ú} it {è è ù da " e " re di il gl io ione ioni cc cch z tt una qu} ms {ah ak j jan ke ngan pu se uan ang ber ku} nl {ij ij ing tj sj z aa ee ei ou oe oe met ge sch te baar} pt {lhã lhõ lha lhe lh nha nhã nho nhõ nhe nt ndo de ão ões ss rr os as ch ça ção ções ico ns apro} }# LES gets carried away and suggests:
# pt {" de " " do " " da " " os " " as " " e " " que " " em " " nos " " nas " " na " "de " "te " " se " " às " " aos " " com " lhã lhõ lha lhe lh nha nhã nho nhõ nhe ndo ão õe ãe ssa sse ssi sso ssu rra rre rri rro rru ch ça ção ções ico} # ''(removed: de nt os as ns apro ss ões)''The identification code itself is fairly straightforward:
proc lang'identify {string features} { set res {} foreach {language regexps} $features { set score 0 foreach regexp $regexps { incr score [regexp -all -nocase $regexp $string] } if $score {lappend res [list $language $score]} } set t [lsort -decreasing -integer -index 1 $res] }But most important is the testing. I collected sample phrases from food packages, which are often pretty multilingual in Germany, and other sources. I kept adding sample strings, and tuning the above feature set until the target language came first most of the time (short strings can't always be identified right):KPV: If you need examples in various unusual languages you could use Google's advanced search page and specify different languages to search for.
set ntests 0 set nproblems 0 set score 0 foreach {language string} { de {Dies ist ein Beispiel für einen deutschen Satz} de {Knusprige Weizenflocken mit Schokoladengeschmack} de {Waffeln mit feiner Haselnusscremefüllung} de {Vor dem Öffnen bitte schütteln} de {Trocken lagern, vor Licht schützen} de {Sicherheitsinformationen für Netzkabel und Zubehör} de {Auf dieses Produkt wird eine zwölfmonatige Garantie gegeben} en {This is an example for an English sentence} en {Wafers filled with hazelnut creme} en {Store in a dry place, protect from light} en {Safety precautions for power cords and accessories} en {This product is warranted for the period of twelve months} en {Worldwide telephone numbers} en {For continuous quality improvement, calls may be monitored} es {Él que no espera vencer, ya está vencido} es {Copos de trigo tostados con chocolate} es {Barquillos rellenos de crema de avellanas} es {Precauciones de seguridad para cables de alimentación y accesorios} es {Este producto está garantizado por un período de doce meses} es {Esta garantía no cubre ninguno de los siguientes casos} es {En caso necesario, la lista de nuestros Servicios Autorizados está disponible} fr {Voilà un autre exemple pour une phrase francaise} fr {Gaufrettes fourrées à la noisette} fr {Agiter avant d'ouvrir} fr {Conserver au réfrigerateur une fois ouvert et consommer dans les jours qui suivent} fr {Protéger contre la lumière} fr {A consommer de préférence avant fin:} fr {Précautions de sécurité concernant les cordons d'alimentation} fr {Cet appareil est couvert par une garantie de douze mois} ga {Eolas an Chuairteora do Láithreáinin Oidhreachta} ga {Tá roinnt bríonna leis an bhfocal Dúchas} ga {Tugann na milliúin daione cuairt ar ár n-ionaid gach bliain} ga {Tá na hAmanna oscailte sa bhfoilseachán seo i gceart ag am priondála} ga {Ionad Cuairteora Pháirc an Fhionnuisce} ga {Tógadh an caisleán le linn na mblianta 1870-73} ga {Deirtear gur tógadh an foirgneamh is sine anseo 400 bliain ó shin} it {Questo è un altro esempio per una frase italiana} it {Fiocchi di frumento al cioccolato} it {Wafers ripieni di crema alla nocciola} it {Agitare prima di aprire} it {Da consumare preferibilmente entro il:} it {Una volta aperto tenere in frigo e consumare entro qualche giorno} it {Precauzioni relative alla sicurezza per i cavi di alimentazione} it {Questo prodotto è garantito per un periodo di dodici mesi} it {Se non è vero, è ben trovato} ms {Pendahuluan cetak yang dibarahui} ms {Dia pun hendak ikut saya ke kedai} ms {Orang itu pun membuat kerjanya dengan cepat} ms {Pukul sembilan setengah malam} ms {Jepun pun kalah dalam pertandingan bolasepak Piala Merdeka} ms {Meja ini baik, tetapi meja itu pun baik juga} ms {Cik Pun pun turut serta dalam pertandingan itu} nl {Het is tijd om op te staan, vandaag is het zaterdag} nl {Wafeltjes met hazelnootcrèmevulling} nl {Tegen licht beschermen. Tenminste houdbaar tot einde:} nl {Dit produkt is gegarandeerd voor een periode van twaalf maanden} nl {De garantie is alleen geldig wanneer de garantiekaart volledig is ingevuld} nl {In probleemgevallen kunt U nadere informatie verkrijgen} nl {Reparaties onder garantie moeten door servicecentra worden uitgevoerd} pt {Alguns quatrilhões de ítens de informação, formando amostragens de} pt {Essas pulsações eram armazenadas em mnemocircuitos idênticos} pt {Na praça da cidade, a fila tinha se formado às cinco da manhã com os} pt {sentido humorístico. Tinha um fim. Estava armado da lista telefônica.} pt {aproximar da Estação Ether e poderei aproveitar a caminhada para} } { incr ntests set res [join [lang'identify $string $features]] if {[lindex $res 0] ne $language} { puts "$string\n $res **** should have been: $language" incr nproblems } elseif {[llength $res]==2} { incr score [lindex $res 1] } elseif {[lindex $res 1] > [lindex $res 3]} { incr score [expr {[lindex $res 1]-[lindex $res 3]}] } else { puts "$string\n $res **** ambiguous: $language" incr nproblems } } puts "score: [expr 30*$score/$ntests] Passed:[expr $ntests-$nproblems]/$ntests"
GS (040612) Some hints to go further at the Gertjan van Noord web site [1].
slebetman The opening line of this common Malay rhyme fails the test and is identified as english with a score of {en 2} {nl 1}:
set teststring {dua tiga kucing berlari}adding "ber" and "ku" to the list of ms identifiers improves the match with a score of {en 2} {ms 2} {nl 1}. However the full rhyme passes the original test with a score of {ms 5} {en 4} {ga 4} {nl 2} {it 1}. Here's the full rhyme:
set teststring { dua tiga kucin berlari, mana nak sama si kucing belang, dua tiga boleh ku cari, mana nak sama si adik seorang. }The addition of "ber" and "ku" improves the result further with a score of {ms 9}. So I added them to the features list above.
you can get utf-8 samples of a boatload of languages here: http://unicode.org/udhr/