Integrate new section mapping database
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Pginer-WMF
	Apr 27 2022, 8:40 AM

Description

A new approach was used to generate section mappings in T293511. As a result a database was produced capturing those mappings.

This ticket is about integrating this database into Section Translation so that the tool can make use of the new mappings. This is expected to improve the way equivalent sections are identified.

The integration should be done in a way that allows for future updates of the data from the different approaches (previous CX translations, MT, and this new approach) independently. That is, w should be able next year to re-generate the database created in T293511 and replace the old contents without removing the mappings coming from CX translations in the process.

After the integration we may want to re-check the problematic cases captured in T283817 to update their status.

Details

Subject	Repo	Branch	Lines +/-
Update cxserver to 2022-06-15-074244-production	operations/deployment-charts	master	+1 -1
Section alignment db: Size optimizations	mediawiki/services/cxserver	master	+19 -2
Update cxserver to 2022-05-31-123738-production	operations/deployment-charts	master	+29 -5
Use hosted mysql database for section suggestion	mediawiki/services/cxserver	master	+492 -43
mariadb: Add cxserverdb grant	operations/puppet	production	+3 -0
Add script to extract the section title mappings from ML based db	mediawiki/services/cxserver	master	+118 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T279064 Observations from first research study for Section Translation on Bengali Wikipedia
Open	None	T276212 Improve section mapping for Section Translation
Resolved	• santhosh	T306963 Integrate new section mapping database
Resolved	Ladsgroup	T309486 Deploy cxserver db password in private puppet repository

Event Timeline

Pginer-WMF triaged this task as Medium priority.Apr 27 2022, 8:40 AM

Pginer-WMF created this task.

KartikMistry subscribed.Apr 27 2022, 8:44 AM

• santhosh mentioned this in T293511: Expand section aligment to more languages, and share dumps.Apr 27 2022, 11:06 AM

Pginer-WMF added a project: Language-Team (Language-2022-April-June).Apr 27 2022, 4:32 PM

Pginer-WMF moved this task from Backlog to General infrastructure on the SectionTranslation board.Apr 29 2022, 8:35 AM

Aklapper removed a subscriber: Language-Team (Language-2022-April-June).May 6 2022, 7:56 AM

Pginer-WMF moved this task from Quarter Backlog to Priority: SX Adoption on the Language-Team (Language-2022-April-June) board.May 9 2022, 7:11 AM

• santhosh changed the task status from Open to In Progress.May 10 2022, 11:31 AM

• santhosh claimed this task.

• santhosh moved this task from Priority: SX Adoption to In Progress on the Language-Team (Language-2022-April-June) board.

Change 790654 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/cxserver@master] Add script to extract the section title mappings from ML based db

https://gerrit.wikimedia.org/r/790654

gerritbot added a project: Patch-For-Review.May 10 2022, 11:31 AM

Change 790654 merged by jenkins-bot:

[mediawiki/services/cxserver@master] Add script to extract the section title mappings from ML based db

https://gerrit.wikimedia.org/r/790654

Maintenance_bot removed a project: Patch-For-Review.May 16 2022, 12:31 PM

Change 792598 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/cxserver@master] Section alignment db: Size optimizations

https://gerrit.wikimedia.org/r/792598

gerritbot added a project: Patch-For-Review.May 17 2022, 12:19 PM

After running the above script to extract the section titles for the language pairs that are used by CX(we have 733 language pairs with at least 10 translations), our section alignment database is 472MB. It used to be ~55MB.

This is a huge sqlite file to be in a git repository. We cannot push this to gerrit too because of the 100MB limitation for file(We get error error Object too large (495,554,560 bytes), rejecting the pack. Max object size limit is 104,857,600 bytes.).

This sqlite database has a single table with source language, target language, source title, target title columns. Using this as part of nodejs based cxserver has been quite easy and performant enough so far. Even with this large size, it is fast enough.

A simple option I can think of is splitting this db to small db files based on source languages and conditionally use it from the code that reads this database for section suggestions. There are 6692506 records in total, for source language=en, we have 1452320 records, For es: 529625 records. We extracted this from 2.5GB sqlite db file that the research team provided.

@Ladsgroup, Do you have any advice here? (Happy to explain the context if it is not clear). Is there a hassle free solution to place this db somewhere?

Hi!
Yeah, putting it in gerrit wouldn't work. it's going to bloat the whole repo and its history and cloning it in the future can cause all sorts issues for the server (from apache to gerrit's db) and client. If it's data, it belongs to a database not a vcs.

We don't have a sqlite db in production but really quickly we can set up a dedicated mariadb database in misc cluster (probably m5 or something like that) and we give you creds for it to connect from the cx service. Uploading the data can be either from just sftp to mwmaint or something else like running a mw script that would populate the db. Worst case, hand it over to us and we do the initial dance but you need to build something for future updates.

It's small enough for us to be able to accommodate your db request quickly. We can also have a quick call to go through the details.

m5 should be fine, but connections would need to go thru the proxy (ie: connect to m5-master.eqiad.wmnet)
What's the amount of read/writes you'd expect?

Thanks @Ladsgroup and @Marostegui. I have placed the sql dump of titles table from Sqlite database here - https://people.wikimedia.org/~santhosh/titles.tar.gz (114 MB)
The whole sqlite db also place at https://people.wikimedia.org/~santhosh/cx-section-titles-aligned.db.tar.gz (117.8MB)

There won't be any write operation from cxserver to this table. It is for readonly purpose from cxserver. Data populatation happens through scripts and we might run once in a year or so to update.
The read traffic is very low at this point as the section translation feature which will use this db is deployed only to small sized wikis for logged in users. Roughtly 100 translations per week, but we expect this to grow up as we expose the tool to more users. The data is used to suggest sections missing in articles and invite users to translate. This is shown in a dashboard. Here is an example API that use this data https://cxserver.wikimedia.org/v2/suggest/sections/Plato/en/fr
This is accessed only when a logged in user in a small set of configured wikis access Special:CX page - So very low traffic at this point time

We will need some changes in cxserver code to access this db and proxy configuration to connect. We will take care of that once DB is set up.

I certainly can put the data there but I let Manuel create the database (or we can do it together so I'd be more confident doing it next time) and we would give you credentials so your tool could connect to it and query it.

Give us:

db username
db name
db grants

@Ladsgroup dbname can be cxserverdb. Others can be anything provided by setting up DB in the Production.

@KartikMistry I've created the db on m5 and you can query it on m5-master.eqiad.wmnet, the username is cxserver and I put the password in your home directory in deploy1002.

I haven't moved the data there yet, will do it ASAP.

What are the grants needed for this?

I added the GRANT, it's SELECT only for now.

Let's add them to puppet to on the m5.sql file (if that wasn't done already)

I'll do it once I moved the data.

Change 798661 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] mariadb: Add cxserverdb grant

https://gerrit.wikimedia.org/r/798661

Change 798661 merged by Ladsgroup:

[operations/puppet@production] mariadb: Add cxserverdb grant

https://gerrit.wikimedia.org/r/798661

Before I move on to inserting the data:

CREATE TABLE titles
            (source_language TEXT,
            target_language TEXT,
            source_title TEXT,
            target_title TEXT,
            frequency INTEGER);

Makes creating the index impossible:

root@db1107.eqiad.wmnet[cxserverdb]> CREATE INDEX title_source_target
    ->         ON titles (source_language, target_language);
ERROR 1071 (42000): Specified key was too long; max key length is 3072 bytes

We probably should use something smaller than TEXT. varbinary(255) is more than enough. Would that work?

(And/or we could use lang code which is much much smaller and reduces the size of db drastically but that's your call)

Oh it's already lang code, let's just go with varbinary(63), that's way more than enough.

The data is now there, a very few insertions failed due to encoding issues but it should be done now.

In T306963#7953702, @Ladsgroup wrote:

The data is now there, a very few insertions failed due to encoding issues but it should be done now.

Thanks, @Ladsgroup!

Change 799338 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/cxserver@master] WIP: Use hosted mysql database for section suggestion

https://gerrit.wikimedia.org/r/799338

• santhosh moved this task from In Progress to In Review on the Language-Team (Language-2022-April-June) board.May 26 2022, 6:38 AM

Change 799338 merged by jenkins-bot:

[mediawiki/services/cxserver@master] Use hosted mysql database for section suggestion

https://gerrit.wikimedia.org/r/799338

Change 801663 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update cxserver to 2022-05-31-111430-production

https://gerrit.wikimedia.org/r/801663

Ladsgroup closed subtask T309486: Deploy cxserver db password in private puppet repository as Resolved.May 31 2022, 12:29 PM

• santhosh moved this task from In Review to Recheck after deployment on the Language-Team (Language-2022-April-June) board.Jun 7 2022, 6:51 AM

Change 801663 merged by jenkins-bot:

[operations/deployment-charts@master] Update cxserver to 2022-05-31-123738-production

https://gerrit.wikimedia.org/r/801663

Deployed in production now. The followup patch is: https://gerrit.wikimedia.org/r/c/mediawiki/services/cxserver/+/803895

Change 792598 merged by jenkins-bot:

[mediawiki/services/cxserver@master] Section alignment db: Size optimizations

https://gerrit.wikimedia.org/r/792598

In T306963#7988537, @KartikMistry wrote:

Deployed in production now. The followup patch is: https://gerrit.wikimedia.org/r/c/mediawiki/services/cxserver/+/803895

Deployed.

• santhosh moved this task from Recheck after deployment to Done on the Language-Team (Language-2022-April-June) board.Jun 14 2022, 6:39 AM

Change 805726 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update cxserver to 2022-06-15-074244-production

https://gerrit.wikimedia.org/r/805726

Change 805726 merged by jenkins-bot:

[operations/deployment-charts@master] Update cxserver to 2022-06-15-074244-production

https://gerrit.wikimedia.org/r/805726