Page MenuHomePhabricator

add <langconvert> parser tag
Open, Needs TriagePublic

Description

For the Balinese palm-leaf project grant I'm working on, I will need the ability to transliterate text in Balinese script to Latin script. The output will look something like this Palmleaf.org page.

The transliteration rules will be implemented in the new Balinese LanguageConverter class (under review). However, the existing LanguageConverter facilities are not sufficient, because I don't want to convert whole pages into either Balinese or Latin script. Rather, I want the Latin transliteration to supplement the Balinese original and appear below it. This means that I need a way to convert particular chunks of wikitext from one Balinese variant to another, and insert the result in a flexible manner.

To do this, I propose adding a <langconvert> tag to CoreParserTags.php to allow flexible access to LanguageConverter. It takes two attributes: from (language variant from) and to (language variant to). For example, <langconvert from="sr-Latn" to="sr-Cyrl">zdravo</langconvert> would return "здраво" (Latin Serbian to Cyrillic Serbian).

Event Timeline

Change 627938 had a related patch set uploaded (by David Kamholz; owner: David Kamholz):
[mediawiki/core@master] implement #transliterate parser function

https://gerrit.wikimedia.org/r/627938

Change 636108 had a related patch set uploaded (by David Kamholz; owner: David Kamholz):
[mediawiki/core@master] Implement Balinese language converter

https://gerrit.wikimedia.org/r/636108

Change 627938 merged by jenkins-bot:
[mediawiki/core@master] Implement <langconvert> tag

https://gerrit.wikimedia.org/r/627938

kamholz renamed this task from add #transliterate parser function to add <langcovnert> parser tag.Dec 15 2020, 8:40 PM
kamholz updated the task description. (Show Details)
Reedy renamed this task from add <langcovnert> parser tag to add <langconvert> parser tag.Dec 16 2020, 2:50 AM
Johan subscribed.

Added to https://meta.wikimedia.org/wiki/Tech/News/2020/52 – please let me know if there are any mistakes in the text.

Added to https://meta.wikimedia.org/wiki/Tech/News/2020/52 – please let me know if there are any mistakes in the text.

Looks great, thanks!

Change 651011 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/core@master] Parser test for Balinese language conversion

https://gerrit.wikimedia.org/r/651011

Change 636108 merged by jenkins-bot:
[mediawiki/core@master] Implement Balinese language converter

https://gerrit.wikimedia.org/r/636108

Please (also) support BCP 47 conform language codes like sr-Latn or sr-Cyrl instead of sr-el and sr-ec for the tag <langconvert>.

BCP 47 conform language codes are already needed for HTML attributes like the HTML attribute lang:

<span lang="sr-Cyrl">здраво</span>

Currently the HTML attribute lang requires a BCP 47 conform language code and the element langconvert requires the MediaWiki internal language codes:

<span lang="sr-Cyrl"><langconvert from="sr-el" to="sr-ec">zdravo</langconvert></span>

Better is to the use always the BCP 47 conform language codes:

<span lang="sr-Cyrl"><langconvert from="sr-Latn" to="sr-Cyrl">zdravo</langconvert></span>

Better is to the use always the BCP 47 conform language codes:

<span lang="sr-Cyrl"><langconvert from="sr-Latn" to="sr-Cyrl">zdravo</langconvert></span>

I agree that this would be better. Unfortunately SrConverter internally uses sr-ec and sr-el rather than BCP 47 compliant codes, so in order to identify the correct converter it must be via those codes. There is a mechanism for converting from sr-ec to sr-Cyrl but in this case we'd have to go the other way, and I'm not aware of any such conversion mechanism built into the core classes (Language, LanguageCode, LanguageFactory, LanguageConverter).

The correct fix is probably to change the internal codes in SrConverter and any other converters that use non-standard codes, but those codes are used in other places too like i18n so I assume that wouldn't be an easy change. I'm open to other ways to do this but I'd need to understand all of the implications and I don't know enough about the possible impacts to judge that well right now.

Can we open a new phab task for this? I apologize for not noticing/flagging this earlier. There are a number of tasks already in phab to deprecate and remove the old mediawiki codes (including sr-ec, sr-el, etc) and it would be a significant step backwards to have the old names written into article wikitext, which would require manually updating all that wikitext in the future.

I *think* there's already functionality to convert from BCP-47 codes to internal mediawiki ones, since that is needed by the REST APIs (for example) which use BCP-47 codes in their HTTP standard language request headers. But if not, I'm happy to write that function for use.

EDIT: LanguageConverter::validateVariant() is the existing method which accepts bcp-47 codes and converts them to internal codes. Ideally you'd have code like this:

$internalCode = $converter->validateVariant( $givenCode );
if ( LanguageCode::bcp47( $internalCode ) !== $givenCode ) {
   // error or warning or something, at least a tracking category
}

Change 651011 merged by jenkins-bot:
[mediawiki/core@master] Parser test for Balinese language conversion

https://gerrit.wikimedia.org/r/651011

There's an issue with this extension tag. Sometimes it ignores -{H|rule}- rules when converting, no matter they are placed within or out of the tag.

Besides, why is it required to use the strict BCP47 name? For example, zh-hans-sg is an acceptable name, but users are more used to zh-sg.

Besides, why is it required to use the strict BCP47 name? For example, zh-hans-sg is an acceptable name, but users are more used to zh-sg.

On zh-SG the script is not clear because in https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry there is no Suppress-Script definition on zh or SG. Therefor zh-Hans-SG has to be used.

Besides, why is it required to use the strict BCP47 name? For example, zh-hans-sg is an acceptable name, but users are more used to zh-sg.

On zh-SG the script is not clear because in https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry there is no Suppress-Script definition on zh or SG. Therefor zh-Hans-SG has to be used.

However, formats like zh-sg are frequently used. For example, in the url, the parameter variant or uselang uses zh-sg instead of zh-hans-sg.

The URL parameters uselang and variant support both variants: uselang=zh-sg and variant=zh-sg and also uselang=zh-hans-sg and variant=zh-hans-sg.

There's an issue with this extension tag. Sometimes it ignores -{H|rule}- rules when converting, no matter they are placed within or out of the tag.

The reason is T302158#7724040.