Help:CirrusSearch elasticsearch replicas: Difference between revisions

From Wikitech
Content deleted Content added
Created page with "Cloud Elastic is a replica of the CirrusSearch elasticsearch indices made available to WMF cloud applications. Applications can use the full powe..."
 
No edit summary
 
(22 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{Cloud VPS nav}}
Cloud Elastic is a replica of the [[mw:Extension:CirrusSearch|CirrusSearch]] elasticsearch indices made available to WMF cloud applications. Applications can use the full power of the [https://www.elastic.co/guide/en/elasticsearch/reference/6.5/query-dsl.html elasticsearch search API's] to query the search indices in ways that CirrusSearch doesn't expose directly on the wiki's themselves.
'''Cloud Elastic''' is a replica of the [[mw:Extension:CirrusSearch|CirrusSearch]] elasticsearch indices made available to Wikimedia Cloud Services applications (both Cloud VPS and Toolforge). These servers are not generally accessible from the internet at large, rather they are only accessible through applications running inside Cloud Services. Applications can use the full power of the [https://www.elastic.co/guide/en/elasticsearch/reference/7.10/query-dsl.html elasticsearch search API's] to query the search indices in ways that CirrusSearch does not expose directly on the wiki's themselves.


=== Accessing ===
=== Accessing ===
There are actually three clusters, named chi, psi and omega. chi contains approximately the 200 largest wikis. psi and omega contain equal splits of the remaining smaller wikis.
There are actually three clusters, named ''chi'', ''psi'' and ''omega''. ''chi'' contains approximately the 200 largest wikis. ''psi'' and ''omega'' contain equal splits of the remaining smaller wikis. Assignment of wikis to clusters is constant and is not expected to change.

{|
{|
|-
|-
! Name
! Cluster Name
! URL
! URL
|-
|-
| chi
| chi
|<code>https://cloudelastic1001.wikimedia.org:8243/</code>
|<code>http<nowiki>s</nowiki>://cloudelastic.wikimedia.org:8243/</code>
|-
|-
| psi
| psi
|<code>https://cloudelastic1001.wikimedia.org:8643/</code>
|<code>http<nowiki>s</nowiki>://cloudelastic.wikimedia.org:8643/</code>
|-
|-
| omega
| omega
|<code>https://cloudelastic1001.wikimedia.org:8443/</code>
|<code>http<nowiki>s</nowiki>://cloudelastic.wikimedia.org:8443/</code>
|}
|}


Clusters can be accessed through each other using the elasticsearch [https://www.elastic.co/guide/en/elasticsearch/reference/6.5/modules-cross-cluster-search.html cross cluster search] syntax. For example labswiki, which lives on the omega cluster, can be queried through the chi cluster with:
Clusters can be accessed through each other using the elasticsearch [https://www.elastic.co/guide/en/elasticsearch/reference/6.5/modules-cross-cluster-search.html cross cluster search] syntax. For example ''labswiki'' (wikitech's internal database name), which lives on the omega cluster, can be queried through the chi cluster with:


<code>curl -XGET https://cloudelastic1001.wikimedia.org:8243/omega:labswiki/_search?q=example</code>
<code>curl -XGET http<nowiki>s</nowiki>://cloudelastic.wikimedia.org:8243/omega:labswiki/_search?q=example</code>

A plausible method to programatically connect to the right cluster is to fetch the <code>/_aliases</code> end-point from each cluster. The cluster that contains indices for a wiki will have an alias matching the internal database name of the wiki. This alias will point to all related indices, such as ''labswiki_content_<ts>'' and ''labswiki_general_<ts>''. There are additional single-index aliases that map from a generic name, like ''labswiki_content'' to the exact index used such as ''labswiki_content_123456789''. Applications should always access indices through aliases to ensure clean switchover when indices are rebuilt for operational reasons.


=== Indices Available ===
=== Indices Available ===


All wikis have two indices, of the format <code>&lt;dbname&gt;_content</code> and <code>&lt;dbname&gt;_general</code>. The content index contains all of the content namespaces of the wiki, the general index contains everything else. So for example on wikipedia's articles are found in the content index, and talk pages are found in the general index. Querying both indices can be done by providing only the wiki db name.
All wikis have two indices, of the format <code>&lt;dbname&gt;_content</code> and <code>&lt;dbname&gt;_general</code>. The content index contains all of the content namespaces of the wiki, the general index contains everything else. So for example on wikipedia's articles are found in the content index, and talk pages are found in the general index. Querying both indices can be done through an alias by providing only the wiki db name.

The set of indices that exist in a cluster can be queried through the elasticsearch [https://www.elastic.co/guide/en/elasticsearch/reference/6.5/cat-indices.html cat indices] API.

<code>curl -XGET http<nowiki>s</nowiki>://cloudelastic.wikimedia.org:9843/_cat/indices</code>

=== Schema ===

See [[mw:Extension:CirrusSearch/Schema]].


=== Example Use Cases ===
=== Example Use Cases ===


==== Query all wikis ====
==== Query all indices ====

<code>curl -XGET http<nowiki>s</nowiki>://cloudelastic.wikimedia.org:8243/*,*:*/_search?q=example</code>

==== Query all content indices ====

<code>curl -XGET http<nowiki>s</nowiki>://cloudelastic.wikimedia.org:8243/*_content,*:*_content/_search?q=example</code>

==== Fetch full document for single page by page id ====
<code>curl -XGET http<nowiki>s</nowiki>://cloudelastic.wikimedia.org:8243/enwiki_content/_doc/33179123</code>

==== Fetch full document for single page by title ====
<code>curl -XGET http<nowiki>s</nowiki>://cloudelastic.wikimedia.org:8243/enwiki_content/_search?q=title.keyword:Elasticsearch</code>

==== Fetch full document for page by approximate page title ====

This is the underlying functionality that powers 'go directly to page' of the wiki autocomplete box. Target title: Ñuñoa

<code>curl -XGET http<nowiki>s</nowiki>://cloudelastic.wikimedia.org:8243/enwiki_content/_search?q=title.nearmatch=nunoa</code>

==== Count words in a wiki ====

This demonstrates sending a full JSON query in the GET body, and extracting only part of the result using [https://github.com/stedolan/jq/ jq].

<code>curl -s -XGET -H 'Content-Type: application/json' -d '{"query":{"bool":{"filter":[{"terms":{"namespace":[0]}}]}},"aggs":{"word_count":{"sum":{"field":"text.word_count"}}},"stats":["sum_word_count"]}' http<nowiki>s</nowiki>://cloudelastic.wikimedia.org:8243/enwiki_content/_search | jq -r .aggregations.word_count.value</code>

(Note that the namespace filter should potentially be adjusted for other wikis.)

== See also ==
* [[Help:Toolforge/Elasticsearch]]: read/write Elasticsearch service for [[Portal:Toolforge|Toolforge]] tool

{{:Help:Cloud Services communication}}


[[Category:Documentation]]
<code>curl -XGET https://cloudelastic1001.wikimedia.org:8243/*,*:*/_search?q=example</code>
[[Category:Cloud VPS]]
[[Category:Cloud Services]]

Latest revision as of 20:07, 28 November 2023

Cloud Elastic is a replica of the CirrusSearch elasticsearch indices made available to Wikimedia Cloud Services applications (both Cloud VPS and Toolforge). These servers are not generally accessible from the internet at large, rather they are only accessible through applications running inside Cloud Services. Applications can use the full power of the elasticsearch search API's to query the search indices in ways that CirrusSearch does not expose directly on the wiki's themselves.

Accessing

There are actually three clusters, named chi, psi and omega. chi contains approximately the 200 largest wikis. psi and omega contain equal splits of the remaining smaller wikis. Assignment of wikis to clusters is constant and is not expected to change.

Cluster Name URL
chi https://cloudelastic.wikimedia.org:8243/
psi https://cloudelastic.wikimedia.org:8643/
omega https://cloudelastic.wikimedia.org:8443/

Clusters can be accessed through each other using the elasticsearch cross cluster search syntax. For example labswiki (wikitech's internal database name), which lives on the omega cluster, can be queried through the chi cluster with:

curl -XGET https://cloudelastic.wikimedia.org:8243/omega:labswiki/_search?q=example

A plausible method to programatically connect to the right cluster is to fetch the /_aliases end-point from each cluster. The cluster that contains indices for a wiki will have an alias matching the internal database name of the wiki. This alias will point to all related indices, such as labswiki_content_<ts> and labswiki_general_<ts>. There are additional single-index aliases that map from a generic name, like labswiki_content to the exact index used such as labswiki_content_123456789. Applications should always access indices through aliases to ensure clean switchover when indices are rebuilt for operational reasons.

Indices Available

All wikis have two indices, of the format <dbname>_content and <dbname>_general. The content index contains all of the content namespaces of the wiki, the general index contains everything else. So for example on wikipedia's articles are found in the content index, and talk pages are found in the general index. Querying both indices can be done through an alias by providing only the wiki db name.

The set of indices that exist in a cluster can be queried through the elasticsearch cat indices API.

curl -XGET https://cloudelastic.wikimedia.org:9843/_cat/indices

Schema

See mw:Extension:CirrusSearch/Schema.

Example Use Cases

Query all indices

curl -XGET https://cloudelastic.wikimedia.org:8243/*,*:*/_search?q=example

Query all content indices

curl -XGET https://cloudelastic.wikimedia.org:8243/*_content,*:*_content/_search?q=example

Fetch full document for single page by page id

curl -XGET https://cloudelastic.wikimedia.org:8243/enwiki_content/_doc/33179123

Fetch full document for single page by title

curl -XGET https://cloudelastic.wikimedia.org:8243/enwiki_content/_search?q=title.keyword:Elasticsearch

Fetch full document for page by approximate page title

This is the underlying functionality that powers 'go directly to page' of the wiki autocomplete box. Target title: Ñuñoa

curl -XGET https://cloudelastic.wikimedia.org:8243/enwiki_content/_search?q=title.nearmatch=nunoa

Count words in a wiki

This demonstrates sending a full JSON query in the GET body, and extracting only part of the result using jq.

curl -s -XGET -H 'Content-Type: application/json' -d '{"query":{"bool":{"filter":[{"terms":{"namespace":[0]}}]}},"aggs":{"word_count":{"sum":{"field":"text.word_count"}}},"stats":["sum_word_count"]}' https://cloudelastic.wikimedia.org:8243/enwiki_content/_search | jq -r .aggregations.word_count.value

(Note that the namespace filter should potentially be adjusted for other wikis.)

See also

Communication and support

Support and administration of the WMCS resources is provided by the Wikimedia Foundation Cloud Services team and Wikimedia movement volunteers. Please reach out with questions and join the conversation:

Discuss and receive general support
Stay aware of critical changes and plans
Track work tasks and report bugs

Use a subproject of the #Cloud-Services Phabricator project to track confirmed bug reports and feature requests about the Cloud Services infrastructure itself

Read stories and WMCS blog posts

Read the Cloud Services Blog (for the broader Wikimedia movement, see the Wikimedia Technical Blog)