Page MenuHomePhabricator

wdqs-updater crashing when update size is too large
Closed, ResolvedPublic

Description

around 11:30 UTC Nov 23 2018, wdqs updater crashed on all wdqs nodes. The updater is restarted by systemd a few times, but enters failed state once start limit is reached. The logs don't provide much information (see below).

Restarting the updater manually does not solve the issue, it goes back to failed after a number of attempts.

11:40:15.580 [update 8] INFO  o.wikidata.query.rdf.tool.rdf.Munger - Weird reference http://www.wikidata.org/reference/5ff5d1308dcd060222a9dcea0a29de7f5828ed2c: unknown value http://www.wikidata.org/value/7e281616976c7de150357c18e76abfd1 for http://www.wikidata.org/entity/Q22089547
#logback.classic pattern: %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
11:40:39.662 [main] INFO  o.w.q.r.t.change.ChangeSourceContext - Checking where we left off
11:40:39.666 [main] INFO  o.w.query.rdf.tool.rdf.RdfRepository - Checking for left off time from the updater
11:40:40.070 [main] INFO  o.w.query.rdf.tool.rdf.RdfRepository - Found left off time from the updater
11:40:40.071 [main] INFO  o.w.q.r.t.change.ChangeSourceContext - Found start time in the RDF store: 2018-11-23T11:25:15Z
11:40:40.144 [main] INFO  o.w.q.rdf.tool.change.KafkaPoller - Creating consumer wdqs1005
11:40:40.729 [main] INFO  o.w.q.rdf.tool.change.KafkaPoller - Subscribed to 6 topics
11:40:40.730 [main] INFO  o.w.q.rdf.tool.change.KafkaPoller - Set topic eqiad.mediawiki.page-undelete-0 to (timestamp=1542972315000, offset=140144)
11:40:40.730 [main] INFO  o.w.q.rdf.tool.change.KafkaPoller - Set topic codfw.mediawiki.revision-create-0 to (timestamp=1542972315000, offset=56945679)
11:40:40.730 [main] INFO  o.w.q.rdf.tool.change.KafkaPoller - Set topic codfw.mediawiki.page-undelete-0 to (timestamp=1542972315000, offset=6575)
11:40:40.730 [main] INFO  o.w.q.rdf.tool.change.KafkaPoller - Set topic eqiad.mediawiki.revision-create-0 to (timestamp=1542972315000, offset=982852035)
11:40:40.730 [main] INFO  o.w.q.rdf.tool.change.KafkaPoller - Set topic codfw.mediawiki.page-delete-0 to (timestamp=1542972315000, offset=409711)
11:40:40.731 [main] INFO  o.w.q.rdf.tool.change.KafkaPoller - Set topic eqiad.mediawiki.page-delete-0 to (timestamp=1542972315000, offset=7485389)
11:40:42.883 [main] INFO  o.w.q.rdf.tool.change.KafkaPoller - Found 927 changes
11:41:04.045 [update 0] INFO  o.wikidata.query.rdf.tool.rdf.Munger - Weird reference http://www.wikidata.org/reference/f0241d2920c31202a1c72a60f2e154b44ffb7447: unknown value http://www.wikidata.org/value/054f90e140d6e4478605edbffbcb192e for http://www.wikidata.org/entity/Q2841515
11:41:04.434 [update 6] INFO  o.wikidata.query.rdf.tool.rdf.Munger - Weird reference http://www.wikidata.org/reference/290275f5e6979162473e25b67ddf983812d47fc5: unknown value http://www.wikidata.org/value/054f90e140d6e4478605edbffbcb192e for http://www.wikidata.org/entity/Q388110

Event Timeline

Out of memory error on a bind:

Nov 23 11:49:57 wdqs1005 wdqs-updater[13325]: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
Nov 23 11:49:57 wdqs1005 wdqs-updater[13325]:         at java.util.Arrays.copyOf(Arrays.java:3332)
Nov 23 11:49:57 wdqs1005 wdqs-updater[13325]:         at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
Nov 23 11:49:57 wdqs1005 wdqs-updater[13325]:         at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:649)
Nov 23 11:49:57 wdqs1005 wdqs-updater[13325]:         at java.lang.StringBuilder.append(StringBuilder.java:202)
Nov 23 11:49:57 wdqs1005 wdqs-updater[13325]:         at java.util.regex.Matcher.appendReplacement(Matcher.java:883)
Nov 23 11:49:57 wdqs1005 wdqs-updater[13325]:         at java.util.regex.Matcher.replaceAll(Matcher.java:955)
Nov 23 11:49:57 wdqs1005 wdqs-updater[13325]:         at java.lang.String.replace(String.java:2240)
Nov 23 11:49:57 wdqs1005 wdqs-updater[13325]:         at org.wikidata.query.rdf.tool.rdf.UpdateBuilder.bind(UpdateBuilder.java:32)
Nov 23 11:49:57 wdqs1005 wdqs-updater[13325]:         at org.wikidata.query.rdf.tool.rdf.UpdateBuilder.bindStatements(UpdateBuilder.java:70)
Nov 23 11:49:57 wdqs1005 wdqs-updater[13325]:         at org.wikidata.query.rdf.tool.rdf.RdfRepository.syncFromChanges(RdfRepository.java:303)
Nov 23 11:49:57 wdqs1005 wdqs-updater[13325]:         at org.wikidata.query.rdf.tool.Updater.handleChanges(Updater.java:213)
Nov 23 11:49:57 wdqs1005 wdqs-updater[13325]:         at org.wikidata.query.rdf.tool.Updater.run(Updater.java:129)
Nov 23 11:49:57 wdqs1005 wdqs-updater[13325]:         at org.wikidata.query.rdf.tool.Update.run(Update.java:133)
Nov 23 11:49:57 wdqs1005 wdqs-updater[13325]:         at org.wikidata.query.rdf.tool.Update.main(Update.java:76)

Mentioned in SAL (#wikimedia-operations) [2018-11-23T12:04:32Z] <gehel> manually increasing wdqs-updater heap to 4G - T210235

increasing heap stops the updater from crashing, but blazegraph refuses updates > 200M (we probably don't want to increase this limit)

Change 475455 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: reduce batch size to 300 temporarily

https://gerrit.wikimedia.org/r/475455

Change 475455 merged by Gehel:
[operations/puppet@production] wdqs: reduce batch size to 300 temporarily

https://gerrit.wikimedia.org/r/475455

Reducing batch size seems to work, updates are processed again.

Mentioned in SAL (#wikimedia-operations) [2018-11-23T14:02:23Z] <gehel> restor wdqs-updater heap to 2G - T210235

Gehel triaged this task as High priority.Nov 26 2018, 1:25 PM
Gehel added a subscriber: Smalyshev.

The immediate issue is solved. I'd like a review from @Smalyshev to see if we have a good way to prevent this from happening again without increasing complexity too much.

All I can think is having a retry mechanism with smaller batch sizes when the update queries start to be too large, but it seems like a lot of additional complexity for a very specific error scenario. We can probably do better.

This is weird, bind shouldn't consume that much space... I'll need to look into the code in detail, but something looks wrong there. Update batches should not be that big.

Smalyshev renamed this task from wdqs-updater crashing on all wdqs servers to wdqs-updater crashing when update size is too large.Nov 28 2018, 7:53 PM
Smalyshev lowered the priority of this task from High to Medium.Dec 11 2018, 1:04 AM

Given that this seems to be very rare, dropping the priority a bit.

Also got this one today:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:472)
        at java.lang.StringBuffer.append(StringBuffer.java:310)
        at java.lang.StringBuffer.append(StringBuffer.java:97)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:486)
        at java.lang.StringBuffer.append(StringBuffer.java:338)
        at java.util.regex.Matcher.appendReplacement(Matcher.java:890)
        at java.util.regex.Matcher.replaceAll(Matcher.java:955)
        at java.lang.String.replace(String.java:2240)
        at org.wikidata.query.rdf.tool.rdf.UpdateBuilder.bindValue(UpdateBuilder.java:40)
        at org.wikidata.query.rdf.tool.rdf.RdfRepository.syncFromChanges(RdfRepository.java:308)
        at org.wikidata.query.rdf.tool.Updater.handleChanges(Updater.java:214)
        at org.wikidata.query.rdf.tool.Updater.run(Updater.java:129)
        at org.wikidata.query.rdf.tool.Update.run(Update.java:163)
        at org.wikidata.query.rdf.tool.Update.main(Update.java:88)

We probably need some special handling for huge updates.

We have a limit of 20M per update, and we usually update about 500 to 700 entities, so we expect the average item's data size be no more than around 30k. However, we have things like:

-rw-r--r-- 1 smalyshev wikidev 763K Dec 17 08:40 Q56939386.ttl
-rw-r--r-- 1 smalyshev wikidev 634K Dec 17 08:40 Q56901541.ttl
-rw-r--r-- 1 smalyshev wikidev 634K Dec 17 08:40 Q56881481.ttl
-rw-r--r-- 1 smalyshev wikidev 621K Dec 17 08:40 Q56900584.ttl
-rw-r--r-- 1 smalyshev wikidev 620K Dec 17 08:40 Q56895083.ttl
-rw-r--r-- 1 smalyshev wikidev 618K Dec 17 08:40 Q56895307.ttl
-rw-r--r-- 1 smalyshev wikidev 618K Dec 17 08:40 Q56900617.ttl
-rw-r--r-- 1 smalyshev wikidev 615K Dec 17 08:40 Q56895281.ttl
-rw-r--r-- 1 smalyshev wikidev 614K Dec 17 08:40 Q56897805.ttl
-rw-r--r-- 1 smalyshev wikidev 614K Dec 17 08:40 Q56902115.ttl
-rw-r--r-- 1 smalyshev wikidev 614K Dec 17 08:40 Q56897887.ttl
-rw-r--r-- 1 smalyshev wikidev 613K Dec 17 08:40 Q56901716.ttl
-rw-r--r-- 1 smalyshev wikidev 613K Dec 17 08:40 Q56899853.ttl
-rw-r--r-- 1 smalyshev wikidev 613K Dec 17 08:40 Q56898200.ttl
-rw-r--r-- 1 smalyshev wikidev 610K Dec 17 08:40 Q56900456.ttl
-rw-r--r-- 1 smalyshev wikidev 610K Dec 17 08:40 Q56897358.ttl

Looks like many scientific article entries are enormous. Most of regular entries still are under 10k. We may want to track entity sizes btw, I wonder if it's easy to do (maybe extract somehow from HTTP response?)

I assume when we encounter such data we'd have to chunk the update.

Change 480697 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/rdf@master] Enable chunking of updates to not overrun POST buffer in Blazegraph

https://gerrit.wikimedia.org/r/480697

Change 480697 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/rdf@master] Enable chunking of updates to not overrun POST buffer in Blazegraph

https://gerrit.wikimedia.org/r/480697

Change 480697 merged by jenkins-bot:
[wikidata/query/rdf@master] Enable chunking of updates to not overrun POST buffer in Blazegraph

https://gerrit.wikimedia.org/r/480697

Change 482815 had a related patch set uploaded (by Gehel; owner: Gehel):
[wikidata/query/rdf@master] Move max POST size configuration to RdfRepository.

https://gerrit.wikimedia.org/r/482815

Change 482902 had a related patch set uploaded (by Gehel; owner: Gehel):
[wikidata/query/rdf@master] Extract the classification of statements in its own class.

https://gerrit.wikimedia.org/r/482902

Change 482815 merged by jenkins-bot:
[wikidata/query/rdf@master] Move max POST size configuration to RdfRepository.

https://gerrit.wikimedia.org/r/482815

Change 482902 merged by jenkins-bot:
[wikidata/query/rdf@master] Extract the classification of statements in its own class.

https://gerrit.wikimedia.org/r/482902