Replica database schema (tables and indexes)
Many of the indexes on these tables are actually compound indexes designed to optimize the runtime performance of the MediaWiki software rather than to be convenient for ad hoc queries. For example, a naive query by page_title such at SELECT * FROM page WHERE page_title = 'NOFX'; will be slow because the index which includes page_title is a compound index with page_namespace. Adding page_namespace to the WHERE clause will improve the query speed dramatically: SELECT * FROM page WHERE page_namespace = 0 AND page_title = 'NOFX';
Stability of the mediawiki database schema maintenance/tables.sql
shows the HEAD of the mediawiki changes. Extra tables may be available due to additional extensions setup in production. Also some tables may have been redacted or filtered for containing private data such as the user passwords or private ip addresses. Aside from that, while we try to synchronize production with development HEAD, changes to the database structure may be applied in advance (or more commonly) lag behind its publication. The reason for this is that schema changes are being continuously applied to production databases, and due to the amout of data, it may take a few hours to a few months (in the case of more complex cases) to be finalized.
Core tables, such as revision, page, user, recentchanges rarely change, but cloud maintainers cannot guarantee they will never change, as they have to follow the production changes. While we are happy for people to setup scripts and tools on top of the database copies (wikireplicas) expect the schema to change every now and then. If you cannot do small tweaks from time to time to adapt to the latest schema changes, using the API instead of the database internals is suggested, as API changes have more guarantees of stability and a proper lifecycle and deprecation policy. That is not true for mediawiki database internals, although compatibility views can sometimes be setup to require only minimal changes.
Tables for revision or logging queries involving user names and IDs
tables do not have indexes on user columns. In an email
, one of the system administrators pointed out that this is because "those values are conditionally nulled when supressed" (see also phab:T68786
for some more detail). One has to instead use the corresponding revision_userindex
for these types of queries. On those views, rows where the column would have otherwise been nulled are elided; this allows the indexes to be usable.
Example query that will use the appropriate index (in this case on the rev_actor column)
SELECT rev_id, rev_timestamp FROM revision_userindex WHERE rev_actor=1234;
Example query that fails to use an index because the table doesn't have them:
SELECT rev_id, rev_timestamp FROM revision WHERE rev_actor=1234;
You should use the indexes so queries will go faster (performance).
The majority of the user_properties table
has been deemed sensitive and removed from the Wiki Replica databases. Only the disableemail
, and nickname
properties are available.
Some of the standard MediaWiki tables that are in use on Wikimedia wikis, are not available. The following tables are missing or empty:
There is a table with automatically maintained meta information about the replicated databases: meta_p.wiki
. See Quarry #4031
for an up-to-date version.
The database host containing the meta_p database is: meta.analytics.db.svc.wikimedia.cloud.
MariaDB [meta_p]> DESCRIBE wiki;
| Field | Type | Null | Key | Default | Extra |
| dbname | varchar(32) | NO | PRI | NULL | |
| lang | varchar(12) | NO | | en | |
| name | text | YES | | NULL | |
| family | text | YES | | NULL | |
| url | text | YES | | NULL | |
| size | decimal(1,0) | NO | | 1 | |
| slice | text | NO | | NULL | |
| is_closed | decimal(1,0) | NO | | 0 | |
| has_echo | decimal(1,0) | NO | | 0 | |
| has_flaggedrevs | decimal(1,0) | NO | | 0 | |
| has_visualeditor | decimal(1,0) | NO | | 0 | |
| has_wikidata | decimal(1,0) | NO | | 0 | |
| is_sensitive | decimal(1,0) | NO | | 0 | |
MariaDB [meta_p]> select * from wiki limit 1 \G
*************************** 1. row ***************************
If there is a network/Wiki Replica db infrastructure problem, production problem, maintenance (scheduled or unscheduled), excessive load or production or user's queries blocking the replication process, the Wiki Replicas can "lag" behind the production databases.
To identify lag, see the replag tool
or execute yourself on the database host you are connected to:
(email@example.com) [heartbeat_p]> SELECT * FROM heartbeat;
| shard | last_updated | lag |
| s1 | 2018-01-09T22:47:05.001180 | 0.0000 |
| s2 | 2018-01-09T22:47:05.001190 | 0.0000 |
| s3 | 2018-01-09T22:47:05.001290 | 0.0000 |
| s4 | 2018-01-09T22:47:05.000570 | 0.0000 |
| s5 | 2018-01-09T22:47:05.000670 | 0.0000 |
| s6 | 2018-01-09T22:47:05.000760 | 0.0000 |
| s7 | 2018-01-09T22:47:05.000690 | 0.0000 |
| s8 | 2018-01-09T22:47:05.000600 | 0.0000 |
8 rows in set (0.00 sec)
This table is based on the tool pt-heartbeat
, not on SHOW MASTER STATUS, producing very accurate results, even if replication is broken, and directly comparing it to the original master, and not the replicas's direct master.
- shard: s1-8. Each of the production masters. The wiki distribution can be seen at: https://noc.wikimedia.org/db.php
- last_updated: Every 1 second, a row in the master is written with the date local to the master. Here you have its value, once replicated. As it is updated every 1 second, it has a measuring error of [0, 1+] seconds.
- lag: The difference between the current date and the last_updated column (timestampdiff(MICROSECOND,`heartbeat`.`heartbeat`.`ts`,utc_timestamp())/1000000.0). Again note that updates to this table only happen every second (it can vary on production), so most decimals are meaningless.
To directly query the replication lag for a particular wiki, use requests like:
MariaDB [fawiki_p]> SELECT lag FROM heartbeat_p.heartbeat JOIN meta_p.wiki ON shard = SUBSTRING_INDEX(slice, ".", 1) WHERE dbname = 'fawiki';+------+| lag |+------+| 0 |+------+1 row in set (0.09 sec)
Please note that some seconds or a few minutes of lag is considered normal, due to the filtering process and the hops done before reaching the public hosts.
User-created databases can be created on a shared server: tools.db.svc.wikimedia.cloud. Database names must start with the name of the credential user followed by two underscores and then the name of the database: <credentialUser>__<DBName> (e.g. "s51234__mydb").
The credential user is not your user name. It can be found in your $HOME/replica.my.cnf file. The name of the credential user looks something like 'u1234' for a user and 's51234' for a tool account. You can also find the name of the credential user using a live database connection:
SELECT SUBSTRING_INDEX(CURRENT_USER(), '@', 1);
Privileges on the database
Users have all privileges and have access to all grant options on their databases. Database names ending with _p
are granted read access for everyone. Please create a ticket
if you need more fine-grained permissions, like sharing a database only between 2 users, or other special permissions.
Steps to create a user database on tools.db.svc.wikimedia.cloud
To create a database on tools.db.svc.wikimedia.cloud:
- Become your tool account.
maintainer@tools-login:~$ become toolaccount
- Connect to tools.db.svc.wikimedia.cloud with the replica.my.cnf credentials:
mysql --defaults-file=$HOME/replica.my.cnf -h tools.db.svc.wikimedia.cloudYou could also just type:
- In the mysql console, create a new database (where CREDENTIALUSER is your credentials user, which can be found in your ~/replica.my.cnf file, and DBNAME the name you want to give to your database. Note that there are 2 underscores between CREDENTIALUSER and DBNAME):
MariaDB [(none)]> CREATE DATABASE CREDENTIALUSER__DBNAME;
You can then connect to your database using:
$ mysql --defaults-file=$HOME/replica.my.cnf -h tools.db.svc.wikimedia.cloud CREDENTIALUSER__DBNAME
$ sql tools
MariaDB [(none)]> USE CREDENTIALUSER__DBNAME;
Assuming that your tool account is called "mytool", this is what it would look like:
$ maintainer@tools-login:~$ become mytool
$ tools.mytool@tools-login:~$ mysql --defaults-file=$HOME/replica.my.cnf -h tools.db.svc.wikimedia.cloud
MariaDB [(none)]> select substring_index(current_user(), '@', 1) as uname;+---------------+| uname |+---------------+| u123something |+---------------+1 row in set (0.00 sec)MariaDB [(none)]> create database u123something__wiki;
Note: Some projects like python-Django can throw an exception like MySQLdb._exceptions.OperationalError: (1709, 'Index column size too large. The maximum column size is 767 bytes.') when migrated using the setup above. This can be fixed by altering the database charset to utf-8in most cases. To avoid this, create the database using the following command instead to specify the charset:
MariaDB [(none)]> CREATE DATABASE CREDENTIALUSER__DBNAME CHARACTER SET utf8;
ToolsDB Backups and Replication
We maintain two copies of the user and tool databases in ToolsDB, with a MySQL primary-replica setup. However, we don't do offline backups of any of the databases in ToolsDB. ToolsDB users can backup their data using mysqldump
$ umask o-r # dump should not be public (unless the database is)$ mysqldump --defaults-file=~/replica.my.cnf --host=tools.db.svc.wikimedia.cloud credentialUser__DBName > ~/DBname-$(date -I).sql
$ umask 0022 # restore default umask
Note that we don't recommend storing backups permanently on NFS (/data/project, /home, or /data/scratch on Toolforge) or on any other Cloud VPS hosted drive. True backups should be kept offsite.
There are some ToolsDB databases which are not replicated for various reasons including size and access patterns:
Users were warned about this particularity and details can be seen at task T127164