Page MenuHomePhabricator

Stop using blank nodes for encoding SomeValue and OWL constraints in WDQS
Open, HighPublic

Description

Problem statement:

We are experiencing severe performance issues on the process that keeps wikidata and the triple store behind WDQS synced. These performance issues cause edits on wikidata to be throttled. While reviewing the way we do updates on the store we decided to move most of its synchronization/reconciliation process out of the triple store with an objective in mind of sending only the minimal amount information needed to mutate the graph with a set of trivial operations (ADD/REMOVE triples). This is where blank nodes are problematic (to dig further into why it's problematic I suggest reading the proposal on TurtlePatch which is an attempt to formalize a patching format for RDF backends).

Where blank nodes are currently used

wikibase we use blank nodes for two purposes:

  • denote the existence of a value (ambiguously named unknown value in the UI) (originally discussed in T95441)
  • owl constraints of wdno property

For the SomeValue use-case we seem to only use the blank node as a way to filter such value.
For the OWL constraints it's unclear if it is actually used/useful.

Suggested solution

One option is to do blank node skolemization as explained in RDF 1.1 3.5 Replacing Blank Nodes with IRIs.

@prefix genid: <http://www.wikidata.org/.well-known/genid/>

 wd:Q3 a wikibase:Item, wdt:P2 genid:a8d14fa93486370345412093add8f50c .
 wds:Q3-45abf5ca-4ebf-eb52-ca26-811152eb067c a wikibase:Statement ;
     ps:P2 genid:a49fd4307e7deef3b569568be8019566 ;
     wikibase:rank wikibase:NormalRank .

This way such triples would remain "reference-able" allowing to patch the WDQS backend without querying the graph with simple INSERT DATA/DELETE DATA statements.

Problems induced with the approach in WDQS
  • Queries using isBlank() will be broken
    • Mitigate the issue by introducing a new function wikimedia:isSomeValue() so that queries relying on isBlank() can be rewritten.
  • Conflating classic IRIs with SomeValue IRIs (use of isURI/isIRI)
    • Queries using isIRI/isURI will have a risk to conflate SomeValue IRIs and thus would have to be verified.
  • Consumers of WDQS results expecting blank nodes in results:
    • will have to change to understand the skolem IRIs
Migration plan
  • 1. Introduce a new wikibase:isSomeValue() function to ease the transition
  • 2. Start using stable and unique labels for blank nodes in wikibase RDF Dumps
  • 3. Do blank node skolemization in the WDQS update process [BREAKING CHANGE]
  • 4. Skolemize blank nodes in the RDF Dump [BREAKING CHANGE]
NOTE: step 4 is not strictly required to address the work regarding the performances of the update process. It is added as there was some concerns of adding another difference between the dump format and WDQS.

There are more detailed discussions around this topic here as well.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

CCing @mkroetzsch and @Denny for input on the RDF model – they probably have some use cases in mind.

queries based on the number of unknown values on a particular property? Examples would help here I think.

Note that people who are counting unknown values with wdt:P106 ?blank already only count best-rank statements; and if they count all statements via p:P106 ?statement. ?statement ps:P106 ?blank, then they should still get the correct count. So I’m not sure how bad the collapsing of unknown values would be in practice.

The “find entities which share the same value” example is a very good point, though. That might be dangerous.

I did some tests and isBlank is a lot faster (I suppose because this information is inlined as opposed to the IRI that has to be fetched from its dictionary).

If we go with an wikibase:isUnknownValue function, then its implementation might also be able to be very efficient, also looking only at the inlined part of the information? Not sure. (I’m assuming that these wdunk: nodes would be inlined as a special “unknown value” type and the “entity ID + UUID” part, similar to how entity IDs, I believe, are inlined as their type plus the numeric ID part. Hopefully the UUID isn’t too long to be inlined? But maybe even if it is, the wikibase:isUnknownValue function wouldn’t need to load the not-inlined part? I don’t know enough Blazegraph internals.)

Hi,

Using the same value for "unknown" is a very bad idea and should not be considered. You already found out why. This highlights another general design principle: the RDF data should encode meaning in structure in a direct way. If two triples have the same RDF term as object, then they should represent relationships to the same thing, without any further conditions on the shape of that term. Otherwise, SPARQL does not work well. For example, the property paths you can write with * have no way of performing extra tests on the nodes you traverse, so the meaning of a chain must not be influenced by the shape of the terms on a property chain, if you want to use * in queries in a meaningful way.

This principle is also why we chose bnodes in the first place. OWL also has a standard way of encoding the information that some property has an (unspecified) value, but the encoding of this looks more like what we have in the case of negation (no value) now. If we had used this, one would need a completely different query pattern to find people with unspecified date of death and for people with specified date of death. In contrast, the current bnode encoding allows you to ask a query for everybody with a date of death without having to know if it is given explicitly or left unspecified (you don't even have to know that the latter is possible). This should be kept in mind: the encoding is not just for "use cases" where you are interested in the special situation (e.g., someone having unspecified date of death) but also for all other queries dealing with data of some kind. For this reason, the RDF structure for encoding unspecified values should as much as possible look as the cases where there are values.

I am not aware of any other option for encoding "there is a value but we know nothing more about it" in RDF or OWL besides the two options I mentioned. The proposal to use a made-up IRI instead of a bnode gives identity to the unkown (even if that identity has no meaning in our data yet). It works in many unspecified-value use cases where bnodes work, but not in all. The three main confusions possible are:

  1. confusing a placeholder "unspecified" IRI with a real IRI that is expected in normal cases (imagine using a FILTER on URL-type property values),
  2. believing that the data changed when only the placeholder IRI has changed (imagine someone deleting and re-adding a quantifier with "unspecified" -- if it's a bnode, the outcome is the same in terms of RDF semantics, but if you use placeholder IRIs, you need to know their special meaning to compare the two RDF data sets correctly)
  3. accidental or deliberate uses of placeholder IRIs in other places (imagine somebody puts your placeholders as value into a URL-type property)

Case 3 can probably be disallowed by the software (if one thinks of it).

Another technical issue with the approach is that you would need to use placeholder IRIs also with datatype properties that normally require RDF literals. RDF engines will tolerate this, and for SPARQL use cases it's not a huge difference from tolerating bnodes there. But it does put the data outside of OWL, which does not allow properties to be for literals and IRIs at the same time. Unfortunately, there is no equivalent of creating a placeholder IRI for things like xsd:int or xsd:string in RDF (in OWL, you can write this with a class expression, but it will be structurally different from other cases where this data is set).

For the encoding of OWL negation, I am not sure if switching this (internal, structure) bnode to a (generated, unique) IRI would make any difference. One would have to check with the standard to see if this is allowed. I would imagine that it just works. In this case, sharing the same auxiliary IRI between all negative statements that refer to the same property should also work.

So: dropping in placeholder IRIs is the "second best thing" to encode bnodes, but it gives up several advantages and introduces some problems (and of course inevitably breaks existing queries). Before doing such a change, there should be a clearer argument as to why this would help, and in which cases. The linked PDF that is posted here for motivation does not speak about updates, and indeed if you look at Aidan's work, he has done a lot of interesting analysis with bnodes that would not make any sense without them (e.g., related to comparing RDF datasets; related to my point 2 above). I am not a big fan of bnodes either, but what we try to encode here is what they have genuinely been invented for, and any alternative also has its issues.

Please don't think or refer to the blank nodes as just "unknown values".

The term used by the wikibase software is "somevalue". The blank nodes are now often commonly used where the information *is* known, but does not have a wikidata item. This is represented by giving the statement the magic "somevalue" status, plus adding a P1932 "stated as" qualifier to give the (known) information as a text string.

The fact that the UI reports the value as "unknown" is already a menace, an undesirable misrepresentation of how the value is being used. Please don't compound this by letting the characters "unk" or "unknown" anywhere near the RDF data model and the sparql interface.

Please don't think or refer to the blank nodes as "unknown values".

I fully agree. The use of the word "unknown" in the UI was a mistake that stuck. The intention was always to mean "unspecified" without any epistemic connotation. That is: an unspecified value only makes a positive statement ("there is a value for this property") and no negative one ("we [who exactly?] do not know this value").

Example of a Listeria tracking page, counting how many blank nodes are being used this way for the properties used on a particular set of items (in this case: a particular set of books, where the publisher (known) may not yet have an item, or at least not yet a matched item): https://www.wikidata.org/wiki/Wikidata:WikiProject_BL19C/titles_stmts

Yes, at the end of the day it's just using

FILTER(isBlank(?stmt_value)) .

and counting statements, so any of the routes above would work.

But please let's call them "blank values" rather than "unknown values", with functions called wikibase:isBlankValue() or wikibase:isSomeValue() rather than wikibase:isUnknownValue(). Thanks!

Why would we call them “blank values” if we’re transitioning away from blank nodes as the underlying mechanism?

Thanks for all the feedback.
I'll discard the "constant" option.

A note on the motivations:
we plan to redesign the update process as a set of trivial mutations to the graph, as far as I can see updating a graph with blank nodes cannot be a "trivial operation", citing
http://www.aidanhogan.com/docs/blank_nodes_jws.pdf (page 10 Issues with blank nodes):

Given a fixed, serialised RDF graph (i.e., a document), labelling of blank nodes can vary across parsers and across time. Checking if two representations originate from the same data thus often requires an isomorphism check, for which in general, no polynomial algorithms are known.

By making some assumptions on the wikibase RDF model I believe that generating a diff between two entity revisions should be relatively easy even if blank nodes are involved, the problem is when applying this diff to the RDF backend, if it involves blank nodes it cannot be a set of trivial mutations (here trivial means using INSERT|DELETE DATA statements). E.g. if the diff indicates that we need to remove:

wd:Q2 wdt:P576 _:genid1

because DELETE DATA is not possible with blank nodes we have to send something like

DELETE { ?s ?p ?o }
WHERE {
  wd:Q2 wdt:P576 ?o .
  FILTER(isBlank(?o))
  ?s ?p ?o
}

Which will delete all blank nodes attached to wd:Q2 by wdt:P576. I haven't checked but I hope that at most one blank node can be attached to the same subject/predicate, if not this makes the sync algorithm a bit more complex.

dcausse renamed this task from Wikibase RDF dump: stop using blank nodes for encoding unknown values and OWL constraints to Wikibase RDF dump: stop using blank nodes for encoding SomeValue and OWL constraints.Feb 17 2020, 1:29 PM
dcausse updated the task description. (Show Details)

I haven't checked but I hope that at most one blank node can be attached to the same subject/predicate, if not this makes the sync algorithm a bit more complex.

At least currently, this is not the case. I added a second “partner: unknown value” statement to the sandbox item, and now wd:Q4115189 wdt:P451 ?v produces two blank nodes as result.

Once we stop using blank nodes for OWL constraints, though, I believe you can at least assume that blank nodes are never the subject of a triple – would that help? (I feel like this ought to eliminate the need for a full isomorphism check from your quote.)

I haven't checked but I hope that at most one blank node can be attached to the same subject/predicate, if not this makes the sync algorithm a bit more complex.

At least currently, this is not the case. I added a second “partner: unknown value” statement to the sandbox item, and now wd:Q4115189 wdt:P451 ?v produces two blank nodes as result.

Thanks for checking, this makes the diff process and the update query a bit more complex as now we need to track the number of blank nodes attached to a particular subject/predicate. As for the update query I believe this is still possible with:

DELETE { ?s ?p ?o }
WHERE {
  SELECT ?s ?p ?o {
 	wd:Q4115189 wdt:P451 ?o .
  	FILTER(isBlank(?o))
 	?s ?p ?o
  } LIMIT 1 # number of blank nodes to delete
}

But overall this makes updating a triple with a blank node a completely separate operation that cannot be batched with and like INSERT DATA or DELETE DATA.

Once we stop using blank nodes for OWL constraints, though, I believe you can at least assume that blank nodes are never the subject of a triple – would that help? (I feel like this ought to eliminate the need for a full isomorphism check from your quote.)

Indeed, this and the fact that for SomeValue all blank nodes are unique, even the same statement "SomeValue" used as wdt and ps is different currently.
From the point of view of a "simple diff operation" this is a fortunate situation as it makes the update process simpler in the scenario we decline this task and stick with blank nodes. In the case we decide to move forward with IRIs placeholders the object of wdt and ps predicates of the same statement will become identical for SomeValue.

To move this forward I propose the following plan:

  1. add a wikibase:isSomeValue custom function configurable to work as a proxy to isBlank() or STRSTARTS( STR(?o), 'http://www.wikidata.org/prop/somevalue/' ) and announce it
  2. instead of changing the RDF representation generated by wikibase add a new option to the updater/munger to transform (on the fly) blank nodes as IRIs placeholders
  3. setup a test instance of the query service using this proposal and ask for feedback
  4. if no major blockers are encountered we can announce that the RDF representation is about to change
  5. start emitting deprecation warnings when seeing isBlank
  6. after a deprecation period activate placeholder IRIs everywhere
  7. change the wikibase RDF representation

Well, I’d like to see what the IRIs for unknown value in qualifiers and references look like before we move ahead with this plan.

I’m also not yet sold on the rename from “unknown value” to “some value” in this more user-facing location. @Jheald, I’m aware that the snak type is also used to encode “we know the value but can’t represent it”, but do you have a source for how common this is?

(Also, the snak type is somevalue as one word, so to me isSomevalue would make more sense than isSomeValue.)

Well, I’d like to see what the IRIs for unknown value in qualifiers and references look like before we move ahead with this plan.

Sure, I tried to add some but I'm not sure how I did not find my way in the UI, could you try to update the sandbox item so that we can have a look?

@Lucas_Werkmeister_WMDE The qualifier "stated as" (p1932) is currently used on 6.6 million statements. I couldn't get a query to complete to count how many of those statements have an object that's a blank node. My guess might be on the order of about 10,000 but that's just a number pulled out of the air, not based on anything. Could be a *lot* more, if this mechanism has been used eg for scientific papers with unmatched editors, publishers, etc.

(Maybe it will be easier to count under a new approach?)

The number of cases of “we know the value but can’t represent it” may soon be much bigger on Commons though, where the pattern is being used as part of an idiom for creators that don't have a Wikidata item, but are known -- including creators known only by their wiki user-names. The number of those cases -- eg self-taken pictures, self-made diagrams etc -- would probably go into the millions, once it's systematically applied.

@Lucas_Werkmeister_WMDE thanks!

Indeed this becomes a bit more challenging as the statement identifier alone cannot be used to identify a bnode under a particular statement. I'll continue to discuss about this specific issue in T245541 to limit noise on this ticket.

@Jheald about blank nodes usage in T239414 we investigated how blank nodes are currently used and extracted some numbers here: P9859 (count per predicate where a blank node is used a an object).

Sadly such counts won't be faster using this new proposed approach.

@Jheald about blank nodes usage in T239414 we investigated how blank nodes are currently used and extracted some numbers here: P9859 (count per predicate where a blank node is used a an object).

Sorted TSV version: P10531 – the most common properties (apart from the owl:complementOf construct) are described by source (78k), publisher (58k), date of death (14k), given name (13k), and then the first qualifier use, end time (10k).

I had no luck investigating the qualifiers of those properties (assuming that some of the “unknown value” publishers, for instance, may specify the value in some qualifier, be it named as or something else) – T246238 will hopefully shed more light on this.

I've done a lot of work with GLAM data that often includes "unknown" for creator.
Getty ULAN has a whole slew of "unknowns" http://vocab.getty.edu/doc/#ULAN_Hierarchy_and_Classes (note: the counts are several years old, I imagine there are a few more thousands of those now):

  • 500355043 Unidentified Named People includes things like "the master of painting X"
  • 500125081 Unknown People by Culture includes things like "unknown Egyptian" (to be used in situations like "unknown creator, but Egyptian culture"). We've modeled those as gvp:UnknownPersonConcept and groups (schema:Organization) but users still think of them as "persons".
  • Further, there are things like "unknown but from the circle of Rembrandt" or "unknown but copy after Rembrandt" etc, about 20 varieties of them, see

https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Visual_arts/Item_structure#Attribution_Qualifiers and https://www.wikidata.org/wiki/Wikidata:Property_proposal/Attribution_Qualifier

Despite the special value "unknown", actual WD usage shows there are 62k creator|author using the item wd:Q4233718 Anonymous: https://w.wiki/JVr.

I think the two special values are unfortunate because:

  • they introduce special patterns that someone writing a query needs to cater for. Eg I couldn't remember the Novalue syntax to compare the query above to one that uses Novalue
  • they don't reflect the real-life complexity needed in some cases
  • they can't be fitted easily in faceted search interfaces or semantic search UIs: one needs special coding for these special values.

Coming from CIDOC CRM, I also used to worry about the ontological impurity of "makes two unrelated unknown values equal" and "find entities which share the same value". But in practical terms, people would like to be able to search for "anonymous" and "unknown Egyptian" and are smart enough to understand that even if "anonymous" may have the most items in a collection, that doesn't make him the most prolific creator of all times.

Cheers!

In order it make it possible to update the graph without querying, you could probably adapt/tailor the com.bigdata.rdf.store.AbstractTripleStore.Options.STORE_BLAN‌​K_NODES Blazegraph option.

@Luitzen thanks for bringing this up but I haven't included this in the possible solutions because:

  • this feature does not seem to be fully integrated/finished/tested, while I was able to tell blazegraph to store some specific bnode ids I was never able to fully control what the id was. Sesame did seem to still generate its own id depending on the API being used (see https://jira.blazegraph.com/browse/BLZG-1915)
  • blank nodes are not allowed in DELETE/DELETE DATA (even in blazegraph with this option enable) sparql statements so I fear that low-level blazegraph integration would have to be done to take benefit of this option.
  • it's blazegraph specific

You should be aware that also the functions isIRI or isLiteral (depending on property type) and datatype can be used and probably is used to test if a value is somevalue or a real value.

isLiteral should still work, right? Blank nodes aren’t literals, the replacement IRIs won’t be literals either, no change.

isIRI and datatype is a good point, though – such queries will have to be updated.

Yes, isLiteral should still work for properties where the real values are literals. Without knowing the internal workings of Blazegraph I would guess that it is more efficient than STRSTARTS( STR(?o), 'http://www.wikidata.org/prop/somevalue/' ) . Maybe that could be used in some way?

Yes, isLiteral should still work for properties where the real values are literals. Without knowing the internal workings of Blazegraph I would guess that it is more efficient than STRSTARTS( STR(?o), 'http://www.wikidata.org/prop/somevalue/' ) . Maybe that could be used in some way?

What we will implement internally for the isSomeValue function won't be doing exactly STRSTARTS( STR(?o), 'http://www.wikidata.org/prop/somevalue/' ) but uses blazegraph vocabulary and inlining facilities, not sure if this answers your question though.

What we will implement internally for the isSomeValue function won't be doing exactly STRSTARTS( STR(?o), 'http://www.wikidata.org/prop/somevalue/' ) but uses blazegraph vocabulary and inlining facilities, not sure if this answers your question though.

Yes, thank you. I was wondering if it is better (faster) to use isLiteral than wikibase:isSomeValue where possible .

BTW also isNumeric can be used to test if a value is numeric or a blank node, and lang can used to test if a value is a monolingual text or a blank node. These should also still work.

Many queries use the optimizer hint hint:Prior hint:rangeSafe true. when e.g. comparing date or number values with constants in a filter as suggested at https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/query_optimization#Fixed_values_and_ranges. Is there a risc that such queries will fail or give wrong results when somevalue become IRI's, and thus the values will be of different types?

Many queries use the optimizer hint hint:Prior hint:rangeSafe true. when e.g. comparing date or number values with constants in a filter as suggested at https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/query_optimization#Fixed_values_and_ranges. Is there a risc that such queries will fail or give wrong results when somevalue become IRI's, and thus the values will be of different types?

I cannot tell for sure, anything that involve query optimization via hints are by nature extremely fragile. But I believe that these kind of queries will remain as dangerous as they were before the switch.

Multichill subscribed.

This needs community consensus before moving forward.

@Multichill the discussion seems to have stalled. Thanks to Peter the pros and cons have been well summarized now. I also understand that part of the misunderstanding of this change was the lack of clarity on the motivations as to why we require a breaking change like that. I hope it had been addressed in the linked discussion.
Do you have additional comments to make here? Thanks!

I don't understand why it was considered necessary to make a breaking change the RDF dump to improve WDQS performance when there is a solution that does not make a breaking change to the dump.

dcausse renamed this task from Wikibase RDF dump: stop using blank nodes for encoding SomeValue and OWL constraints to Stop using blank nodes for encoding SomeValue and OWL constraints in WDQS.Apr 30 2020, 5:12 PM
dcausse updated the task description. (Show Details)

I don't understand why it was considered necessary to make a breaking change the RDF dump to improve WDQS performance when there is a solution that does not make a breaking change to the dump.

It was not considered "necessary" it was considered "preferable" in discussions we had while drafting this 4 steps plan. The sole reason was to limit the divergences between WDQS results and the dumps (current divergences are listed here).
I'm perfectly fine dropping this step (assuming others agree) if this causes much annoyance and only deprecate blank nodes at the WDQS level (I've updated the description of this ticket to reflect the discussions we had on the wikipage).

My view is that fewer breaking changes are to be preferred, and breaking changes in fewer "products" is to be even more preferred. So, again, I wonder why there is a breaking change proposed for the RDF dump instead of no breaking changes or limiting breaking changes to the WDQS only.

My view is that fewer breaking changes are to be preferred, and breaking changes in fewer "products" is to be even more preferred. So, again, I wonder why there is a breaking change proposed for the RDF dump instead of no breaking changes or limiting breaking changes to the WDQS only.

While backward compatibility is important, it isn't the only consideration. As pointed out by @dcausse on wiki having divergence between Wikidata and WDQS is also problematic. We already have a small number of documented divergence and increasing this number is also problematic. Given the current discussion, it seems that keeping as much backward compatibility as possible at the cost of divergence between Wikidata and WDQS is the way to go.

Given the current discussion, it seems that keeping as much backward compatibility as possible at the cost of divergence between Wikidata and WDQS is the way to go.

I strongly disagree – we should apply the change to Wikibase as well. Increasing long-term divergence between the query service and the RDF output or dumps will make working with both of them harder.

If divergence between Wikidata and WDQS is bad, then this proposed change has another bad feature as it turns the some value snaks into something that is less like an existential. And this proposed change is for both the RDF dump and the WDQS.
And then there is the problem of the proposed change requiring changes to SPARQL queries - not just a change, but a change from how SPARQL queries are writtern in just about any other context.

And then there is the problem of the proposed change requiring changes to SPARQL queries - not just a change, but a change from how SPARQL queries are writtern in just about any other context.

In what other context do you write SPARQL queries about Wikibase SomeValue snaks?

Is anyone proposing a change to Wikibase (or Wikidata)?

I would view the proposed change as having the negative outcome that the RDF dump moves further from Wikidata. There are people (myself included) who use the RDF dump without using the WDQS (much).

Is anyone proposing a change to Wikibase (or Wikidata)?

Yes – the goal is that the RDF in the query service, the RDF dumps, and the output of Special:EntityData all change. (Special:EntityData isn’t explicitly mentioned in the task description, but I assume it should change together with the dumps.) Not all at the same time, but in the end they should be consistent again, at least with regards to their handling of SomeValue snaks and OWL constraints (notwithstanding other differences).

I would view the proposed change as having the negative outcome that the RDF dump moves further from Wikidata.

Can you clarify what you mean here by “Wikidata”?

The difference is not with other SPARQL queries in the WDQS but against SPARQL queries in general (including SPARQL queries that use Wikidata URLs).

Of course, there are already are a few important differences between WDQS queries and SPARQL queries against most other RDF KBs.

I would view the proposed change as having the negative outcome that the RDF dump moves further from Wikidata.

Can you clarify what you mean here by “Wikidata”?

From https://www.wikidata.org/wiki/Wikidata:Main_Page: the free knowledge base with 84,918,558 data items that anyone can edit.
So I don't count the RDF dump or WDQS, but I do count Wikibase and its data model.

@Multichill the discussion seems to have stalled. Thanks to Peter the pros and cons have been well summarized now. I also understand that part of the misunderstanding of this change was the lack of clarity on the motivations as to why we require a breaking change like that. I hope it had been addressed in the linked discussion.
Do you have additional comments to make here? Thanks!

See the recent comments. You need to get community consensus before doing any (major) changes.

Is anyone proposing a change to Wikibase (or Wikidata)?

Yes – the goal is that the RDF in the query service, the RDF dumps, and the output of Special:EntityData all change.

Absolutely, in this context a change of the RDF dump implies a change on wikibase output for Special:EntityData and RDF formats.

If divergence between Wikidata and WDQS is bad, then this proposed change has another bad feature as it turns the some value snaks into something that is less like an existential. And this proposed change is for both the RDF dump and the WDQS.

Quoting RDF 1.1 Concepts and Abstract Syntax - 3.5 Replacing Blank Nodes with IRIs:

This transformation does not appreciably change the meaning of an RDF graph, provided that the Skolem IRIs do not occur anywhere else. It does however permit the possibility of other graphs subsequently using the Skolem IRIs, which is not possible for blank nodes.

One could also argue that this change may lead to a positive outcome as it allows the possibility to use these skolem IRIs.

Additionally "unskolemizing" these IRIs is a trivial step that could be added to any import process reading the wikibase RDF output and willing to switch back to blank nodes.

If 'unskolemizing' is a trivial step then that should be implemented by WDQS, instead of pushing it to every consumer (including indirect consumers) of Wikidata information, if this change is simply a change to make WDQS work faster.

If, on the other hand, there are other reasons to make a breaking change to the Wikidata RDF dump then there should be a proposal to make such changes independent of making WDQS faster.

If 'unskolemizing' is a trivial step then that should be implemented by WDQS, instead of pushing it to every consumer (including indirect consumers) of Wikidata information, if this change is simply a change to make WDQS work faster.

WDQS does need to "skolemize" not "unskolemize" but in the end this is the same discussion as pondering whether or not we want WDQS to be close to Wikibase RDF output by moving the "skolemization" before generating the RDF output.

Yes this change is only to make the update process faster by removing the complexity and cost induced by tracking blank nodes. Since all edits to wikidata now depend directly on the efficiency of this process we believe that it is worth this breaking change.

I was completely unaware that WDQS is so integrated into the inner workings of Wikidata. Where is this described? Was this mentioned in the announcement of the proposed change?

In any case there appears to be a reasonable path forward that makes fewer breaking changes.

I was completely unaware that WDQS is so integrated into the inner workings of Wikidata. Where is this described? Was this mentioned in the announcement of the proposed change?

Details on the motivations and the context around this change were clearing lacking in the initial announcement, this is something we will be careful about the next time we communicate about this.
The way WDQS integrates with wikidata editing/tooling workflows is a bit out of our control and I'm not sure that there exists a comprehensive and exhaustive documentation about them (far from ideal but searching for wdqs lag on the wikidata namespace might give some sense of the problems it can cause on contributors).

Based on a quick look at various Phabricator tickets and other information it appears to me that the only connection between the WDQS and Wikidata edit throttling is that a slowness parameter for the WDQS is used to modify a Wikidata parameter that is supposed to be checked by bots before they make edits. Further, it appears that the only reason for this connection is to slow down Wikidata edits so that the WDQS can keep up - the WDQS does not feed back into Wikidata edits, even edits by bots. So this connection could be severed by a trivial change to Wikidata and the only effect would be that the WDQS KB might lag behind Wikidata, either temporarily or permanently, and queries to the WDQS might become slow or even impossible without improvements to the WDQS infrastructure. I thus view it misleading to state in this Phabricator ticket that "performance issues [of the WDQS] cause edits on wikidata to be throttled", which gives the impression that the WDQS forms a part of the Wikidata editing process or some other essential part of Wikidata itself.

There needs to be a very strong rationale to make breaking changes to the Wikidata RDFS dump. Just improving the performance of the WDQS is not enough for me.

I thus view it misleading to state in this Phabricator ticket that "performance issues [of the WDQS] cause edits on wikidata to be throttled", which gives the impression that the WDQS forms a part of the Wikidata editing process or some other essential part of Wikidata itself.

Including WDQS lag into the Wikibase maxlag has been done for reasons, challenging such reasons is out of scope of this ticket and questions should be asked on T221774. De facto this puts WDQS as an essential part of Wikidata itself and one of the reasons our team work has been prioritized to redesign the update process between WDQS and wikidata.

Gehel triaged this task as High priority.Sep 15 2020, 8:01 AM

For reasons that I believe have to do with additional data not changing already inferred facts (AKA monotonicity), certain OWL constructs MUST be expressed as a blank node. I think it's a great idea to remove blank notes wherever possible in wikidata but if you want the data to be conformant with OWL (and thus work in e.g. Protégé, OWLAPI, and some other tools), I believe you are stick using blank nodes for some OWL expressions.

For reasons that I believe have to do with additional data not changing already inferred facts (AKA monotonicity), certain OWL constructs MUST be expressed as a blank node. I think it's a great idea to remove blank notes wherever possible in wikidata but if you want the data to be conformant with OWL (and thus work in e.g. Protégé, OWLAPI, and some other tools), I believe you are stick using blank nodes for some OWL expressions.

This was indeed brought up in earlier discussions, my understanding about OWL and the wikidata RDF representation is that it embarks some OWL expressions as part of its RDF but is not explicitly defined as being OWL compliant (at least not explicitly per https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format).
The removal of blank nodes will follow https://www.w3.org/TR/rdf11-concepts/#section-skolemization and a user willing to use strict OWL semantic using the RDF produced by wikidata can easily do this by transforming the skolem IRIs (.well-known IRIs) back to blank nodes.
This probably deserves its own discussion but I'm curious about OWL usecases that rely on the existing OWL constraints present in the wikidata RDF. If you are aware of such usecases I'd love to hear them.

@ericP Wikidata doesn't use OWL axioms. It uses blanks only for the special values "unknown" and "no value"

Started some documentation about the change at https://www.mediawiki.org/wiki/Wikidata_Query_Service/Blank_Node_Skolemization, comments/suggestions are welcome.

It’s probably worth mentioning in that documentation that this change applies not just to the query service but also to the RDF dumps and Special:EntityData. Otherwise, it looks good to me :)

It’s probably worth mentioning in that documentation that this change applies not just to the query service but also to the RDF dumps and Special:EntityData. Otherwise, it looks good to me :)

Good point, I added some notes about this, thanks!