Page MenuHomePhabricator

[Epic] Extension:JADE scalability concerns
Closed, ResolvedPublic

Description

We realized that there are certain usage patterns which would cause Extension:JADE to create unacceptable numbers of pages. Since we create a JADE-namespace page for each revision being judged, we cannot allow a judgment for every revision. This behavior would quickly swell the page table to many times its current size, and would at least double the revision table size.

Some quick estimates (detailed in comments below) show that creating a JADE entry for rule-based, automated events such as assigning an "autopatrolled" flag is an anti-pattern that we must discourage at this stage. The volume of these patrolling tags can quite easily equal the number of revisions being created. This has already become a problem for the logging table (T184485: Stop logging autopatrol actions) so autopatrolled events are being discarded there as of a few months ago.

Expected usage at human scales will eventually result in about 0.5M additional new pages created per year, if all existing workflows can be migrated to create judgments.

Our working conclusion is that we need to rely on social agreements to not do silly things with JADE until the technical limitations are overcome. There is strong precedent for this approach, for example https://en.wikipedia.org/wiki/Wikipedia:Bot_policy which already covers the situation we're looking at. Generally, bots are not allowed to create millions of unneeded pages and that's exactly the situation that we're concerned about here. The remaining work would be for our team to write a clear statement about what is currently harmless vs harmful JADE usage, and circulate to the community that might integrate tools with JADE or otherwise use the new namespace.

In the long run, we do hope to overcome the storage limitations, so that we can allow bots and other interesting information sources to populate JADE in ways that make sense, in addition to the humans that we've had in mind as our primary users.

Current thoughts about scalability are being summarized here,
https://etherpad.wikimedia.org/p/JADE_scalability_FAQ

Conclusion

We're going to restrict deployment to all but the biggest wikis. Our margin of safety currently excludes the following wikis:

  • enwiki
  • wikidatawiki
  • commonswiki
  • dewiki
  • frwiki

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Estimating impact:

select count(*) from logging where log_type = 'patrol' and log_timestamp > 20180101000000;

For enwiki, 108,274 patrol actions or 108274 / 13534614 = 0.008 per second.
Wikidata is similar, at 117,508 patrol actions logged this year.

Article history (wp10) assessment templates are transcluded on 39,515 pages.

JADE pages will be created on the same wiki as editing activity.

Conservatively, if all current patrolling and assessment activity is migrated to JADE, we'll add something around 250k rows to the page and revision table over the next year. For comparison, wikidata currently includes 51M pages.

The thing about enwiki is that they have NPP but not RCP (new page patrolling but not recent change patrolling) which means this is the number of new pages in enwiki and not edits, it's safe to assume the total number of patrol actions is around one order of magnitude higher for enwiki (not for Wikidata, it has RCP already)

Thanks, so NPP patrolling isn't logged... Here's a reasonable upper bound for enwiki, if every new page were patrolled:

select count(*) from revision where rev_parent_id = 0 and rev_timestamp > 20180101000000;

1,336,138 this year or 0.1 per second.

Pulling a normal ratio of RCP patrolling to NPP patrolling from frwiki,

select count(*) as all_patrol_count, sum(case when log_params rlike '.*"previd";s:1:"0".*' then 1 else 0 end) as npp_count from logging where log_action='patrol' and log_timestamp > 20180101000000;
+------------------+-----------+
| all_patrol_count | npp_count |
+------------------+-----------+
|           103554 |      6440 |
+------------------+-----------+

Extrapolating to enwiki, that would mean that only 6% of patrolling activity is being logged, so multiplying we get more than the upper bound of all revisions found above.

Wikidata wouldn't survive a year of this upper-bound unscalability. It has received 200M edits in the past 12 months, so we would have, worse-case, created 400M new JADE: pages for each diff and revision judgment.

I think this confirms that we shouldn't deploy to production with the current schema. It might be helpful to engage with tool authors to learn more about their expected use cases as well.

@Ladsgroup has suggested a workaround where we collected judgments on a per-page basis, and this is starting to look good to me.

Some negatives to the per-page approach:

  • Slightly incompatible with ORES, which is per-revision. For example, fetching an ORES+JADE response by revision ID will require an index to lookup page judgment URL from the revision.

Positives:

  • Shouldn't make edit race conditions any worse since edits to a page will happen sequentially.
  • Still mostly adheres to the "everything is a wiki page" philosophy. An index would be overhead, but with precedents like the page links table.
  • Worst case impact is that we inflate the page table by no more than 2x (JADE page for every other page), and the revisions table by another constant, which depends on the number k of reviewers per revision (k JADE revisions for every other revision).

We also have a scalability problem if it becomes popular to create JADE meta-judgments of other JADE revisions. It makes sense to allow this behavior, and I don't expect it to be used for ordinary judgments, only for contentious ones. Out-of-control meta-JADE looks like an infinite loop of judging your own JADE revisions, and would result in unlimited spam until the bot is captured or disabled.

Just to record this, we also discussed that humans wouldn't be able to cause much of a scalability problem unassisted, but if we end up with many bots editing JADE, we could easily have problems. We have to rely on social conventions to prevent disaster scenarios, at least until the many technical loopholes are discovered..

In the per-page schema proposed above, the page-revision index would grow at the scary rate, up to one index entry per revision added to the wiki.

I don't think we should be designing for the worst-case scenario here. There are many situations where content creation patterns are controlled in order to preserve the infra -- e.g. sensitive templates are protected and bots are not allowed to mass-purge pages.

If we're to follow the advice that everything is a wiki page, we don't have many options other than looking to humans to not take it out of control and to regulate bot behavior.

Adding onto @Halfak's comments, I agree that social convention seems to be the best way to protect against runaway JADE usage. Specifically, don't automate bulk actions in JADE. For now, we want it to be a repository of human opinions, so limit integrations to API actions directly conveying human judgments. For example, don't use JADE to tag pages worth of articles as "patrolled" just because a patroller saw an ORES score and a title and decided the changes were low-risk.

That said, there's no reason to keep this restriction in the future, if the technical problems can be resolved. JADE would also be useful as a repository of both human, assisted-human, and bot judgments.

@Ladsgroup it would be great if you could weigh in with your concerns, if you still think we shouldn't deploy?

awight renamed this task from Scalability concerns creating a page per revision to Extension:JADE scalability concerns due to creating a page per revision.Jun 13 2018, 4:57 PM
awight updated the task description. (Show Details)
Halfak updated the task description. (Show Details)
Joe triaged this task as Medium priority.Jun 18 2018, 10:47 AM

I would at least think we should exclude bots from editing/creating such pages.

Absolutely! We're planning to rely on the usual social agreements about bot editing, since it would be impossible to enforce a rule like this using technical means. For reference, the main use case for JADE will be API integration with third-party patroller tools like Huggle.

I would like this to wait for a review by the DBA and Traffic teams.

If this is really necessary, please help us by giving a timeline for review, once this is possible. Personally, I love getting early feedback such as @Ladsgroup starting this current thread by noting that certain usage patterns could be problematic, but I believe we've already solved for these (rely on usual social mechanisms). There's no potential for our deployment to suddenly break the databases or the site. The only "attack" vector would be an out-of-control bot creating page via the MediaWiki API, which is exactly the same, known threat as for any other wiki namespace, and there are simple defenses such as blocking the naughty bot.

Since most concern pivot around the question of scalability, especially of the page table, in the case of massive (over-)use, I'm asking myself:

What level of adoption do you hope for? For 1000 article edits, how many JADE pages should ideally be created? How many would be too many? How many would be too few for your purpose?

For the record, it would be possible to use MCR to store judgments about revisions. The idea would be to have a JADE slot on each page, which would contain the all judgments about revisions of that page. Changes to the judgment of any revision would show up in the history of the page itself. Judgments about other things, like users, would have to love somewhere else.

This would break if there the page as lots of revisions, and lots of them have lots of judgments. The content blob of the JADE slot would then grow too large. This problem would however be self-container, it would have no impact in overall DB performance.

With this scheme, judging every edit once would double the number of rows in the revision table, but would not impact the page table.

Since most concern pivot around the question of scalability, especially of the page table, in the case of massive (over-)use, I'm asking myself:

What level of adoption do you hope for? For 1000 article edits, how many JADE pages should ideally be created? How many would be too many? How many would be too few for your purpose?

We're looking for a few specific things, and also hoping for emergent properties that we can't anticipate.

At the simplest level, very small numbers of judgments, even a single page a few dozen pages, might be enough to bring our attention to systematic issues. For example, "ha" is a real word in Italian and some other languages but otherwise its presence indicates that the author is doing something bad. A handful of ad-hoc reports from itwiki were enough to identify this issue.

At the next level of scale, maybe a few hundred (n.b. this is outside my expertise) JADE pages will be enough to automatically compensate for less obvious bias issues.

Emergent stuff would follow some other rules, probably starting with popular or contentious articles. Even one or two JADE entries on a page should be enough to cause interesting effects, like changes in patrolling behavior. A really exciting situation would be if editors used JADE to resolve some kind of debate, which again takes probably a handful of JADE entries. In this case, the discussion might all take place on one JADE page with multiple judgments and endorsements, and its talk page.

To try to answer your question of how many JADE pages would be needed per article edits, my current thinking is that the proportion of JADE pages created is actually dependent on the accuracy of ORES for the wiki, and on the scale of vandalism. The real limiting factor is the productivity of human patrollers. How many JADE pages we *want* is independent of these ratios, and could be very low and still be useful.

For the record, it would be possible to use MCR to store judgments about revisions. The idea would be to have a JADE slot on each page, which would contain the all judgments about revisions of that page. Changes to the judgment of any revision would show up in the history of the page itself. Judgments about other things, like users, would have to love somewhere else.

This would break if there the page as lots of revisions, and lots of them have lots of judgments. The content blob of the JADE slot would then grow too large. This problem would however be self-container, it would have no impact in overall DB performance.

With this scheme, judging every edit once would double the number of rows in the revision table, but would not impact the page table.

Strange--we looked at MCR (discussion here) and the biggest issue (which I think isn't captured on the talk page) was that the judgments would have to be submitted along with the revision itself, which isn't our use case. Maybe you're suggesting that we would do something like:

  • rev1 - content slot edit
  • rev2 - content slot edit
  • rev3 - JADE slot edit about rev1
  • rev4 - content slot edit

Yes, that's what I'm suggesting: make the JADE edit a separate edit on the page.

You are correct about the limitations, as described in your comment: all judgments for all revisions of the page would live in a single blob.

Suppression would work by suppressing the content of the edit that created the judgment, after reverting it. Per-slot suppression is currently not implemented, since reverting and suppressing edits seems sufficient.

Yes, that's what I'm suggesting: make the JADE edit a separate edit on the page.

Cool, we hadn't actually looked at that possibility. Another reason we avoided MCR was that it was immature at the time we were evaluating, and I was concerned that recent change patrolling and AbuseFilter might not be well-integrated. We'll give it another round of discussion.

I was concerned that recent change patrolling and AbuseFilter might not be well-integrated. We'll give it another round of discussion.

Revision patrolling should Just Work (tm).

AbuseFilter integration is a deployment blocker. Non-main slots will not be deployed until we know that AbuseFilter works reasonably well on them. Making AbuseFilter fully aware of MCR will need more thought. It's not clear what that means, yet - e.g. do we need per-slot rules, or is it OK for all rules to apply to all slots?

Some thoughts about MCR:

  • We want this new structured space to be available for both collaborative auditing and for patrolling, and these activities will require some sort of collaboration such as talk pages. Having a talk page per type of judgment allows very specific conversations. I'm not sure whether the article's talk page would be the right place for these discussions? Maybe sometimes...
  • There would need to be an index where JADE judgments can be looked up per-revision. This is more or less going to become necessary anyway, so not a big deal.
  • If we deploy using the JADE namespace, future migration into MCR will mean breaking all URLs or replacing them with a redirect. This redirect isn't trivial, since we can't anticipate in which revision judgments will be stored.
  • MCR storage would let us support high-volume use cases, which is nice.

I'm not sure whether the article's talk page would be the right place for these discussions?

I think the right place to discuss judgments of edits to a page is that page's talk page. I can't imagine a case where this would not be appropriate.

This redirect isn't trivial, since we can't anticipate in which revision judgments will be stored.

In the current revision. The JADE slot in the page's current revision corresponds to the current revision of a JADE page. Except that it contains the judgments for all revisions of the page.

Anyway, I didn't intend to derail this discussion.... is this the right place to discuss MCR as an alternative?

Anyway, I didn't intend to derail this discussion.... is this the right place to discuss MCR as an alternative?

I think so, at least I'm considering it as a solution to the scalability issue.

Meanwhile, the team discussed the MCR proposal and we don't like that it would force a change to the schema, in which we would have to collect all judgments on every revision of a page into one document, which would grow with the number of revisions as you said, and would make it hard to see conversations focusing on a single revision, etc. We think it would be equivalent to implement this single-page schema in the JADE namespace, which avoids the early-adopter risks of MCR.

Another potential annoyance is that JADE can be applied to any namespace, and we're not sure whether MCR would be available in the Talk namespace, for example? It would also be difficult to use JADE for judging JADE content if using MCR.

Back to giving a reasonable estimate, now that we're only planning for human patrolling actions. frwiki has 212k NPP and RCP patrol logs from the past 12 months, for 11M revisions. This is a rate of about 2% patrolling actions per revision. This probably doesn't scale linearly with wiki size since it's limited by human labor, but making the conservative assumption that we might see a jump in the number of patrollers, looking at our biggest project Wikidata, assuming continued acceleration of 1.27x year-over-year, we'll see 250M revisions created in the next 12 months. This would mean 5M new JADE pages in Wikidata, if we had the patroller labor. More realistically, there are only 254k patrolling logs from the last 12 months, or 0.13% of revisions so it seems that human labor is indeed the true limit.

Using the human labor assumption, the upper bound on JADE impact is 200-300k new pages and revisions per year, for our biggest wikis.

I'm looking at two more data sources that we may decide to integrate with: PageTriage and FlaggedRevs. For the purposes of this discussion, I'll note that we might catch another 1,200 review actions per day for a few large wikis where these tools are used extensively (enwiki and dewiki, respectively), amounting to 500k additional pages and revisions per year on each of these wikis.

Vvjjkkii renamed this task from Extension:JADE scalability concerns due to creating a page per revision to ojbaaaaaaa.Jul 1 2018, 1:06 AM
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
ArielGlenn renamed this task from ojbaaaaaaa to Extension:JADE scalability concerns due to creating a page per revision.Jul 1 2018, 6:47 AM
ArielGlenn lowered the priority of this task from High to Medium.
ArielGlenn updated the task description. (Show Details)
ArielGlenn added a subscriber: Aklapper.

CC @Fjalapeno, I'd be interested in your thoughts about the potential for a flood of data here.

Hi @jcrespo @BBlack, nudging per T183381#4296475 and here, we're hoping to deploy a new extension whose impact is limited to about 0.5M additional pages created per year, on large wikis, assuming the most optimistic, uncontrollable uptake scenario. I'd love to hear DBA and Traffic perspectives on the proposal.

See mw:JADE for context. The potential benefit is high, as the new data being created will be used for retraining AIs, reducing redundancy in patrol and other existing workloads, and research.

From the database point of view, I would like to block completely this deploy outside of testwiki- we can consider deploying its data to a separate set of database servers, but not the main metadata ones (s*).

"we need to rely on social agreements to not do silly things" does not work at all, based on past experiences with similar cases with ORES (evaluating very old revisions) and translation (this turned out ok, but mosty because being on x1 made thins way easier to clean up after the "social contract" failed).

We were going to expand x1 already due to the reading lists needs, maybe if you can wait some months, we can deploy it there, where we will have the extra resources needed?

@jcrespo Thanks for the reply!

"we need to rely on social agreements to not do silly things" does not work at all, based on past experiences with similar cases with ORES (evaluating very old revisions)

I think we can put this concern to rest. JADE and ORES are diametrically opposite systems and furthermore will not be coupled in any way at this point. JADE is a type of wiki page (content handler) that humans will be editing, and ORES is a completely automated score computation and caching service.

There are no custom queries on the database, simply explicit or implicit calls to mw-core WikiPage::doEditContent() and WikiPage::getContent().

The risks are as simple as with deploying any other new namespace.

and translation (this turned out ok, but mosty because being on x1 made thins way easier to clean up after the "social contract" failed).

I'd like to learn more about what happened with translation, but at a quick glance I don't see parallels.

Both incidents https://wikitech.wikimedia.org/wiki/Incident_documentation/20160713-ContentTranslation and T163344: Do a root-cause analysis on CX outage during dc switch and get it back online seem to have been caused complex db queries, although I appreciate why you say the "social contract" failed here, because we had created a DoS vector that was accidentally exploited by a user's unusual behavior. There is no such DoS possible with JADE which is not present in storing pages to the other wiki namespaces.

Can you explain more about the potential for a new namespace to cause db issues?

Sorry, you didn't understood what I meant- for ORES, it was: T159753 and for translation, T183485, both as asummed that people would "do normal things with new functionality". Based on that, we no longer allow content to be stored on metadata servers- content should be stored on content servers OR separated to a different set of servers. New functionality should come with the budget to store such content. As I said, my compromise would be to use x1 cluster, as we were already have already budgeted its expansion, which has less tax in terms of redundancy. We are still recovering from issues like T177702, where a bad architecture resulted in a wiki DOS- and an extension that scales rows based on revisions is a really really really bad idea, as presented with my previous 3 examples. If you don't need to join, just use x1 or content storage.

@jcrespo I see, well in this case content storage is exactly what we're planning to use. Is there anything special to do in order to set that up? For example, the judgment about https://en.wikipedia.org/?diff=12345678 will be made on the same wiki, in https://en.wikipedia.org/wiki/Judgment:Diff/12345678

The scaling of rows as revisions is more a mathematical point of interest than anything else, in our proposal. Same with the fact that JADE can be used to make a judgment about JADE, which looks like it invites an infinite loop. Neither of these are actually issues when the pages are created at human scales, because the limiting factor will be human labor, as I showed in comment T196547#4301801

Ok, that is much better, but I guess it still would double the revision table (or the 5 new tables that are to replace revision). Could those metadata be stored on x1 instead of the main metadata servers (s*) to reduce overhead foot print? Metadata servers are duplicated around 20 times, while x1 is only 4-6 times, while it still has (or will soon) have way more available resources.

We already have major issues with the scalability of the revision table, and doubling its content will only make it worse. Partitioning it on a separate table (and hopefully, host) would help delay those.

There is also to implment strict filters regarding bots. Social rules are not enough to limit bot changes "you shouldn't do X". Those will happen, as I mentioned, with my previous examples.

we do hope to overcome the storage limitations

With which resources? This was not part of our budget from this fiscal year, so where do I get the extra server resources we would need for the (non-trival) exta QPS and storage? We can get extra resources normally if the petition is fair and reasonable, and well-though, but hardware purchases are planned with almost a year in advance!

Even if that was not an issue (maybe those resources were requested) I see no research on the amount of extra cpu/QPS/IOPS and storage needed for this extension.

There's a lot to go through in this thread. We won't be doubling the revision table, my current estimate for the upper bound of activity is actually 0.5M additional pages and revisions per year on the largest wikis, and only hundreds or thousands of additional pages on the smaller wikis. If you want to store revisions from this namespace on x1, that sounds like a reasonable precaution to me. Where is this sharding configured? Is it okay that we continue to use wiki pages and revisions or would we have to use a custom table?

The doubling estimate and "we do hope to overcome the storage limitation" are both in the context of a future use case that we are not supporting for several years, by which point I hope to have improved the schema and had time to budget the storage. This is the case where we accept automated submissions from sources such as integration with the "autopatrolled" algorithm. We've dropped this plan for now, and will actively discourage the behavior as spam-vandalism.

API server resources will be linear with the number of pages added, probably with a very small overhead relative to normal ApiEditPage. I hadn't thought to estimate these resources, would you be able point to existing numbers for the edit API?

To be clear, we don't expect any surge of users once this extension is enabled. Each phase of deployment is quite controlled, and will consist of integrating a specific workflow on a wiki or number of wikis. It's quite hard to create Judgment pages manually and we don't see much risk of that happening on its own, so each workflow is a very intentional step that we can rollback if problems arise.

I think the proposed plan ha deep architecture problems at storage layer, so we should discuss in depth possibilities to be able to move forward- I don't have any problem with the functionality itself- it is the proposed way of implementing it that we should try to agree on. I propose to organize a video meeting to discuss better.

I think the proposed plan ha deep architecture problems at storage layer, so we should discuss in depth possibilities to be able to move forward- I don't have any problem with the functionality itself- it is the proposed way of implementing it that we should try to agree on. I propose to organize a video meeting to discuss better.

Thanks for the suggestion—I tentatively put a meeting on our calendars for this Friday (the 13th :-)

@daniel Surprisingly, there is interest in this going through TechCom after all. I've been digesting discussion into this page, https://etherpad.wikimedia.org/p/JADE_scalability_FAQ but would like to ask you for an example of the best format for presenting the issues to the committee?

Here are the notes from our meeting, plus some more discussion afterwards:
https://etherpad.wikimedia.org/p/JADE_scalability_minutes_2018-07-17

Halfak renamed this task from Extension:JADE scalability concerns due to creating a page per revision to Extension:JADE scalability concerns.Aug 7 2018, 5:14 PM
awight renamed this task from Extension:JADE scalability concerns to [Epic] Extension:JADE scalability concerns.Sep 11 2018, 10:10 PM
awight added a project: Epic.
awight moved this task from Review to Non-Epic on the Machine-Learning-Team (Active Tasks) board.

This was addressed for now, by an agreement between our team and SRE to not install JADE on wikis with revision table size >= 100GB. That is enwiki and wikidata. We're also staying off of dewiki and frwiki because they're close enough to consider within the "unsafety" margin.

This was addressed for now, by an agreement between our team and SRE to not install JADE on wikis with revision table size >= 100GB. That is enwiki and wikidata. We're also staying off of dewiki and frwiki because they're close enough to consider within the "unsafety" margin.

There are some other big wikis (commons) where this is also a concern and some other agreements were made in order to be on the safe side: T200297#4493122
Specially point #5 on that link

Thanks!

There are some other big wikis (commons) where this is also a concern and some other agreements were made in order to be on the safe side: T200297#4493122
Specially point #5 on that link

Good point! Plus, my team decided to also exclude dewiki and frwiki. I've updated the task description to reflect this.