Page MenuHomePhabricator

Null or inconsistent search results using Khmer script
Closed, ResolvedPublic

Description

This idea was submitted to the current Inspire Campaign focused on new readers detailing an issue related to an inability for users in Cambodia to search Wikipedia using their native written language, Khmer. Here in an excerpt from the idea page itself (bolded text is from me) describing the issue in more detail:

For example, a word meaning 'eat' is: ញ៉ាំ pronounced nyarm.
It can be written with the following keystrokes. Note that the script on the left looks the same regardless. Note also that uppercass is achieved by SHIFT + keystroke, so the symbol " is created by SHIFT + '
ញ៉ាំ J"am -- Note that only this first spelling gets any results on Wikipedia, for the definition in the sister project Wikitionary.
ញុាំ JuaM
ញាុំ JauM
ញំុា JMua
ញំាុ JMau
Even though the scripts on the left look the same, if they are pasted into Wikipedia Search as a search terms, each version will generate completely different results. I am guessing that this is because Wikipedia indexes and searches based on the unicode sequence, not the resulting script.

To provide some concrete examples with links to search results on on Khmer Wikipedia (km.wikipeida.org):

  • Search using ញ៉ាំ (one way to write the word for "eat"): search 1, providing appropriate results
  • Search using ញាុំ (another way to write the word for "eat"): search 2, providing one search result not relevant to the meaning of the search term
  • Search using ញំាុ (another way to write the word for "eat"): search 3, a null result.

Event Timeline

Adding Trey (@TJones) to this task in case any related search work might be useful in ascertaining the cause of the problem or what approaches might help address it.

Screenshot of the task description of this very task; Firefox 58 on Fedora 27:

Screenshot from 2018-01-25 18-45-42.png (293×342 px, 25 KB)

Developer Tools output for <p> element in the "Fonts" section says (apart from "Lato" font for Latin script):

Khmer OS System system
Used as: "Khmer OS System"

$:acko\> rpm -qa | grep khmer
khmeros-fonts-common-5.0-23.fc27.noarch
khmeros-base-fonts-5.0-23.fc27.noarch

Yeah, this is definitely relevant to CirrusSearch. I did a quick review of the Wikipedia page on the Khmer script, and dug into one of the sources (Huffman), and "dictionary order" is... complicated. By chance I happened to be sitting 10 feet from @Aklapper when he commented on this, and on his computer they do all look the same (see screenshot above). On my computer (Mac, with a few Khmer fonts installed), they look different!

Screen Shot 2018-01-25 at 6.45.46 PM.png (185×138 px, 15 KB)

The Google Noto Khmer fonts do even weirder things, like it doesn't accept some of the orderings and so doesn't combine the characters:

Screen Shot 2018-01-25 at 6.49.35 PM.png (203×202 px, 22 KB)

The solution, if people writing in Khmer do not use any canonical ordering to the characters, would be to re-order them according to some standard and indexing/searching with that. That is very much complicated by the fact that Khmer doesn't require spaces between words. This seems to be more than we can handle straightforwardly. I'll look to see if there are any good tools out there.

I'm on OS X 10.10.5, and I've discovered that the rendering of Khmer script varies by font, but even more so the support for the advanced properties of Khmer fonts varies widely by application.

I found and installed the "khmeros" fonts @Aklapper has (thanks!!), which can be found here.

In TextEdit various fonts look like this (click for larger images):

Screen Shot 2018-01-26 at 11.06.46 AM.png (270×711 px, 50 KB)

While in Chrome they looks like this:

Screen Shot 2018-01-26 at 11.06.33 AM.png (473×863 px, 74 KB)

Firefox looks more or less like Chrome, but with poorer line spacing. Safari looks like TextEdit, and both are unusable.

The Google Noto fonts do reasonably well, but the "au" order isn't rendered quite right, while the "ua" order is. The Khmer OS fonts look good.

So, anyone who is not already familiar with Khmer computing should find some combination of fonts and applications that allows them to render the examples above correctly. For OS X, the Khmer OS fonts and Chrome seem to do the right thing (and after disabling the fonts that don't work, Phab is now showing things correctly!!).

(Not any closer to a solution, but at least now I have a better handle on the problem.)

Might the solution be as simple as implementing a Khmer spell checker in the contribution text area of the wiki that detects Khmer script. And also a Khmer spell checker in the search box?

Even a spell checker of average quality might be better than none.

Another or complimentary option might be to create an algorithm to extend all search terms into all possible spelling sequences, and then combine all of their results into the one results page. This algorithm could be based off the Unicode rendering rules (if such a thing exists) so that only versions that render the same are included in the extended list. By 'rendered the same' I mean to use the fact that only some character sequences function correctly. Incorrect character sequences create space holders or a sequence of characters that is visually different.

ប្រើ (bjr;) looks different to បើ្រ (b;jr), and ើប្រ (;bjr) so would not need to be included in extended search terms. This is because the user could visually see that they have the incorrect key-press sequence by looking at the represented character sequence.

The way the Unicode works with Khmer script is rather brilliant because it manages things like longer descenders when the collection of symbols goes lower or higher than a typical collection. So I'm guessing it has rules to follow, and these rules might help in determining comparable Unicode value sequences.

ង្ក្រ compared to ក្រ

The spellchecker idea is an interesting one. I don't know what support there is for a Khmer spellchecker, but perhaps people who compute in Khmer already have it active on their computers. I don't know if we could supply a spellchecker, though. And of course people are free to ignore the spellchecker—I ignore mine all the time.

The possibly good news is that the icu_tokenizer, already used on Khmer-language projects, seems to do a reasonable job of tokenizing Khmer, at least into syllables. Interestingly, it seems to ignore non-sensical characters: for ើប្រ, it just ignores the leading " ើ", which seem to have nothing to properly attach to.

... create an algorithm to extend all search terms into all possible spelling sequences, and then combine all of their results into the one results page. This algorithm could be based off the Unicode rendering rules (if such a thing exists) so that only versions that render the same are included in the extended list.

The more direct approach would be to re-order the tokens into some canonical order—like moving all dependent vowels, subscript consonants, and other diacritics to the end of the word—before indexing them. Then all variants would be interchangeable for search purposes. I don't know what the canonical order would be yet, or whether a good one exists, but my current analyzer analysis tools would make it easy to evaluate—and I could do it offline by re-ordering them with a stand alone tool.

The tokenization was the scariest part, so doing the next step of the analysis is definitely tractable.

@Eltimbalino, do you have any more examples you can provide? A few alternate orderings that render the same (like variants of ញ៉ាំ), and alternate orderings that don't work (like ប្រើ) would be useful for basic testing of any approach.

EBjune triaged this task as Medium priority.Feb 6 2018, 6:27 PM
EBjune subscribed.

We need to get a sense of the frequency of this issue, and whether there is a canonical order we can compute.

Anyone here familiar with Khmer (maybe @Eltimbalino?) who can help me with some of the harder corner cases I'm encountering while trying to normalize Khmer syllables?

I've got some automatically re-ordered syllables up for review on Mediawiki. I particularly need help on the first three groups, "???", "Questionably Reordered Syllables", and "Visible Duplicates". Any advice on the others would be great, but those are the ones I am most unsure about.

I've also asked for help on Khmer Wikipedia and Wiktionary.

I've added a preliminary write up of what's been going on so far, including a high-level version of my re-ordering algorithm, on MediaWiki.

Next steps include:

  • Get a sense of the scope of the typo-caused incorrect syllable boundary problem. The complexity of fixing it and the scope of the problem will help determine whether it's worth it to try to handle these cases.
  • Test the effects of re-ordering on tokenization and matching of re-ordered words. I'll look at re-ordering using the command line prototype both before tokenization and after, and see how big a difference it makes on the results. (My guess is that pre-tokenization re-ordering will be much better, but it's complex enough that it's worth it to see how big a difference it makes.)

I also have some additional help/review coming from a couple of sources, so there may be some updates to the re-ordering algorithm and examples.

Further good news: based on my samples, less than 0.2% of syllables need to be re-ordered in Wikipedia and Wiktionary article text, so the problem is important to fix, but not as widespread as it could be. (I should check on queries, too, though.)

Of the syllables that need to be fixed, 0.4% (or approximately 0.00% of all syllables) have syllable boundary issues (these are the typos that my algorithm is messing up), so that is not a huge problem.

I did a review of Khmer Wikipedia queries for comparison. There's a higher rate of syllables to re-order (1.3% vs <0.2%) and a similarly higher rate of syllable boundary errors, though it is still very low (<0.005% of all syllables detected). There's also the usual collection of queries in miscellaneous scripts, junk queries, and porn queries. One unexpected result is that only about half of all queries have Khmer characters in them, and about half have predominantly Latin characters in them.

In Cambodia, a lot more people use the Internet via mobile devices, than via a computer. For a long time it was rather difficult to get Khmer characters and keyboards working on Android devices. During this time, an adhoc transliteration into Latin characters was common on the Internet. I thought that it had recently gotten better, but after reading your comment I'm not so sure.

@Eltimbalino, can you give me a sample of the transliteration? Is it somewhat phonetic, or is it more like the JMua examples above?

It may not be as bad as you might think. I pulled a sample of 400 queries and looked at them more carefully

  • 202 are all or mostly Khmer characters
  • 3 are all punctuation
  • 3 are numbers
  • 1 is Bengali
  • Queries that are all or mostly Latin characters break down as:
    • 111 are www-something, xxx-something, or a variant of xnxx (This fits a common pattern of accidental on-wiki searches that look like they were intended as URLs or as Google searches, especially porn.)
    • 32 are English
    • 17 are names
    • 3 are a single letter
    • 1 is French
    • 1 is Turkish
    • 26 are ones I can't easily categorize

Among the 26, there are three that might be names, two that are just double letters, 14 that have lots of x's (a pattern I see everywhere among queries that get few or no results), and 7 that I would normally classify as "junk" because they don't look like words (but neither does JMua). Apparent junk queries often don't have enough vowels to be phonetic words—like jpmkhfvj or grgrrgg—though they do rarely turn out to be something with unexpected meaning.

Now, a lack of Khmer-character support might still be affecting the usage of Wikipedia, because the transliterated queries probably don't get any results. If that's the only input method you have, then searching on Khmer WIkipedia isn't going to be very useful, so it'd make sense to stop using it.

I've finished my analysis of the effects of pre-tokenization (harder) vs post-tokenization (less hard) re-ordering using my command line re-ordering tool. The difference is pretty big, so I guess I'm going to have to do it the hard(er) way! More details on MediaWiki.

Incidentally, it seems like it would make sense to map Khmer numerals (០១២៣៤៥៦៧៨៩) to Arabic numerals (0123456789) for indexing, too.

Next Steps:

  • Create a character filter plugin to re-order Khmer syllables and test adding it to the Khmer analysis chain.
  • Create a character filter for Khmer-to-Arabic numerals.

Hi Trey,

I've grabbed some samples from different places so that you get variety.
There's no standard for their transliteration into latin characters, so it
is a strange sort of phonetic where ideas about how latin characters sound
vary a lot. I can't blame them, English is very far from phonetic and their
pronunciation of English makes establishing gut-level rules impossible.
You'll also see that they don't mind stacking a few consonants before the
first vowel, and in combinations that are not used in English.

I hope this helps,

Tim.

p.s. Also, I definitely agree with your intention to filter Khmer numerals
into Arabic ones.

Kmean bong duch kmean dong ham.
oun eong vor eong vor tang tekpeak bong som bek bong som beak teang oun min
teun tream klun pheab jichab oun eoch rab rong jun dum bey oun bong ton tov
na puv na
Tos deasdong nis kteach ktum rol they eong vol thay nov kom dol thay oun
los seay ban te pel vele min eoy bong rong cham yuban te.
Mnos srey mneak nis nong tov rhot bonghu tekphek tam but chom reang chrang
teang ho tekpheak beasdong eongvor sreak kom thon beak ban te meas ma ros
kmean bong duch kman dong ham pdem ji min chung ro pros krob beal veale oun
trovka oun nas bong tov tov tov peal oun slab peal oun min jihab pros oun
min jichab teat te kom ton tov peal oun nov ros pros oun each jichab cham
oun slab cham bong tov chus.
oun eong vor eong vor tang tekpeak bong som bek bong som beak teang oun min
teun tream klun pheab jichab oun eoch rab rong jun dum bey oun bong ton tov
na puv na
Tos deasdong nis kteach ktum rol they eong vol thay nov kom dol thay oun
los seay ban te pel vele min eoy bong rong cham yuban te.

sdap jos plaeng khmer sraek jae srey cherk–tomninh snaeha
yub yun sngat sone oun nirk bong tae–yubyun sngatson
20 chnam knong kuk dauy dai srey la-or–20 chnam knong kuk
moha-songkran chran chnam mouy tov–battambong
ter beysach kromom rir avey, chleuy auy trong mork–beysach kromom
tous oun jea srey, beysach kor dauy, kor snae min prae thauy krauy, jet
mouy tlerm mouy louhs krea avasan–besach kromom
bomnach jet snae stir tae laeb baan
bong daeh muat stung–for mak (oun soam doss dai)–the story for mak
touhs jea bong deung tha oun sa-op bong yang na kor dauy kor bong nov tae
prouy tha oun baan kae
somraek sat kreal vea deal jae sdey munus doch kakey, srey chet
sava–tomninh snaeha
soriya yeang lea nopea kiev kjey bopha reek tmey, nea o akara, khnom jea
yuthajun sraek klean aha…bann kun bopha dak tean dauy smoss–bopha reach tmey

Neary krup roup prathna jang ban
Snaeha soksann sopheak mung’kul
Jea snae tae muoy kmean ak-ku-sal
Snaeha scob scall lus dol morana.

Beu mien nisai som joup kom baek
Proml’khet kom jrek oy snae bress cha
Beu ban jea joup joup muoy sang’kha
Jeas phut tuk’kha reung rav kou kam.

Kou snae knyom euy soum jea kou korp
Reung rav phet kbat jronaen pro-jam
Sava nin’tea nireas mien kam
Sach’ja jorng jam kom oy keut mien.

Soum oy chet borng douch jea chet oun
Snae smoss luss soun kmean avey tun trien
Chet muoy tlem muoy pdou pdach krup jeat
Soum juop kom kleat muoy veh’lea leuy.

happy birthday bong joli oun son jun por bong oy mean som nang la or sok ka
pheab la or hey neng rok to tul tean mean ban phong der na jas hey ber bong
pong jong ban a vey som oy ban som rach doch bom

ja!!!!!! orkun Lita pa'oun srey bong som aoy Lita toh toul ban seuch kdey
sok ning sech kdey jom reun toh reang tov

kyom som joun por oy joli mean sok kapheap laor, mean somrors kan tae saart
neng kirt avay oy ban douch bomnong pratna na

pourk yeung ban jaek today na paoun srey
kir kei jom knea nov mok wat ounalorm heuy norm knea york tov jek tam plov
neng mok vaing .

thnai ti 2 khae 1 neng mean jaek bai kajob mdang teat , paoun jang

Thanks, @Eltimbalino. That is a lot more phonetic than the residue of the queries I looked at, so I don't think many people are using the transliterated Khmer on-wiki.

p.s. Also, I definitely agree with your intention to filter Khmer numerals into Arabic ones.

That's good to hear—thanks!

I've moved this to "Waiting" while I wrap up some work on other open tasks.

Ok, here we are a year later. Sorry for the significant delay. Too many other projects have pushed ahead of this one. I'm working on this again and hope to have it done by the end of the calendar year.

Change 647814 had a related patch set uploaded (by Tjones; owner: Tjones):
[search/extra@master] Create new extra plugin for Khmer syllable reordering

https://gerrit.wikimedia.org/r/647814

Change 647814 merged by jenkins-bot:
[search/extra@master] Create new extra plugin for Khmer syllable reordering

https://gerrit.wikimedia.org/r/647814

Change 659369 had a related patch set uploaded (by Tjones; owner: Tjones):
[mediawiki/extensions/CirrusSearch@master] Enable Khmer Syllable Reordering in Analysis Chain

https://gerrit.wikimedia.org/r/659369

Full write up of analysis of new reordering plugin is on MediaWiki: Khmer Reordering Analysis Analysis.

Change 659369 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Enable Khmer Syllable Reordering in Analysis Chain

https://gerrit.wikimedia.org/r/659369