Search/Trouble

From Wikitech

Here's a bunch of ways ES can get into trouble (or has gotten into trouble before) and ways you can help diagnose and/or fix them:

Stuck in red

The clusters can get stuck in the red state if none of the replica shards thinks it is master eligible. You can unstick it with Elasticsearch's Cluster Reroute API by setting allow_primary. They will likely causes data loss but that is worth it because we're busted any way. This will unstick all the stuck shards.

First, check if we have https://github.com/elasticsearch/elasticsearch/issues/4206 merged. If so, use it.

Otherwise, use this (first you should replace $node with the a node name which you can get from _cluster/state?pretty):

function reroute() {
    curl -XPOST 'localhost:9200/_cluster/reroute?pretty' -d '{
        "commands" : [ {
                "allocate" : {
                    "index" : "'$1'",
                    "shard" : '$2'
                    "allow_primary" : true,
                    "node" : "$node"
                }
            }
        ]
    }' > /dev/null
    sleep 1
}
curl -s localhost:9200/_cluster/state?pretty | awk '
    BEGIN {more=1}
    {if (/"UNASSIGNED"/) unassigned=1}
    {if (/"routing_nodes"/) more=0}
    {if (unassigned && /"shard"/) shard=$3}
    {if (more && unassigned && /"index"/) {print "reroute",$3, shard; unassigned=false}}
' > runit
source runit

Note that if you try to assign too many shards to one node too quickly it'll fail. You should pick a new node name and rerun this. Its a pretty blunt instrument.

After assigning the shard you should be able to wait for the cluster to turn green. Once green you should then reindex the effected wiki. It is ok to use the lazy version of the reindex:

mwscript extensions/CirrusSearch/maintenance/forceSearchIndex.php --wiki $wiki

Stuck in yellow

The cluster can get stuck in yellow because there aren't enough servers to take all the replicas. This could be caused by the disks getting too full or us requesting more replicas then we have servers. This doesn't _need_ to be fixed right away but should be address quickly. The root cause should be found soon and, at this point in the project life cycle, that probably means find Nik or Chad.

You can see which wikis have unassigned shards by running this:

curl -s localhost:9200/_cluster/state?pretty | awk '
    BEGIN {more=1}
    {if (/"UNASSIGNED"/) unassigned=1}
    {if (/"routing_nodes"/) more=0}
    {if (more && unassigned && /"index"/) {
        if ($3 != last_index) {
            last_index=$3
            print last_index,count
            count=0
        }
        count=count+1
    }}
    END {print last_index,count}
'

I've also seen a cluster get stuck in yellow (though I suppose red is possible) when there are errors moving shards around. You can check this by tailing /var/log/elasticsearch/ and looking for exceptions. Figuring out where to go from there might require help from the #elasticsearch IRC channel on Freenode or on Libera Chat or digging through Elasticsearch code or just lucky instincts. The issue that I saw was from deserializing a lucene version constant. That implied that my mixed version cluster wasn't functioning properly so I forced all data away from the old versioned nodes.

Reindexing based on a query

Sometimes we'll have a bug which will require reindexing a list of pages that can be easilly generated using an elasticsearch query. This is how you do that:

curl -XPOST testsearch1001:9200/_search?pretty -d'
{
   "filter": {
      "missing": {
          "field": "text_words"
      }
   },
   "size": 100000
}' > bad
perl -ne 'if (/"_index" : "(.*)_[^_]+_[^_]+/) {$wiki = $1} elsif(/"_id" : "(.+)"/) {$fromId = $1 - 1; print "echo $wiki; mwscript extensions/CirrusSearch/maintenance/forceSearchIndex.php --wiki $wiki --fromId $fromId --toId $1\n"}' bad | grep -v labswiki > badfix
bash badfix 2>&1 | tee badfix.log

This reindex all the first 100000 bad pages. Just run it over and over again until it stops finding pages to fix. If it doesn't then you likely have some other kind of bug to deal with.

Note that the query this used is "missing": { "field": "text_words" }. Unless you are fixing the same problem you'll have to substitute your own query.

Updates don't go anywhere and Cirrus's jobs look stuck

If the Cirrus jobs crash (php fatal) during execution they can get stuck in showJobs.php's active state even though they aren't really active. It'll look like this:

$ mwsearch maintenance/showJobs.php --wiki <somewiki> --group
cirrusSearchLinksUpdate: 510 queued; 281 claimed (281 active, 0 abandoned)

Those jobs aren't really active but they are sitting around screwing up statistics. First, find and fix the crash problem. Try mwlog1001:/a/mw-search/fatal.php. To clean up the statistics so far I've always deleted all the jobs:

echo 'JobQueueGroup::singleton()->get( 'cirrusSearchDeletePages' )->delete(); JobQueueGroup::singleton()->get( 'cirrusSearchLinksUpdate' )->delete(); JobQueueGroup::singleton()->get( 'cirrusSearchUpdatePages' )->delete();' |
  mwscript maintenance/eval.php --wiki <somewiki>

but that can kill in flight Cirrus jobs. The statistics may fix themselves after some time. I'm not sure.

Resharding

If some shards take _forever_ to move from node to node then they are probably too large. Figure out the appropriate number of shards using #If_it_has_been_indexed, update the configuration as discussed in #Estimating_the_number_of_shards_required, sync the configuration in a deployment window, then perform an in place reindex. Once this is done the wiki should have the appropriate number of shards and they should move around the cluster reasonably quickly.

Crashed node stuck in cluster

I'm not sure how or why, but sometimes when a node crashes it gets stuck in the cluster. Because we run with 2 replicas one node like this isn't going to cause much trouble, but it should be removed as soon as possible to prevent lack of redundancy. If the node is going to come back online and join the cluster again that should fix the problem. If it isn't, then you need to restart the current master. Look that up in the status api, make sure that there are two other master eligible nodes online, ssh into the current master, and restart Elasticsearch.

Master eligible node crashed and won't come back

Normally if a node disappears from the cluster you don't have to worry about anything beyond capacity. If the node that crashes is master eligible then you'll have to nominate another node to be master eligible. This is because the cluster refuses to do anything if it can't find two masters eligible nodes online. We normally run with three master eligible nodes so we can suffer one broken master at a time. Anyway, you can nominate another node by editing role::elasticsearch::config in the puppet repository. Find the section that defines $master_eligible and remove the old node from the list and add the new one. Once that update is ready to run ssh into the machine, sudo puppetd -tv, and perform a #Rolling Restarts on the node.

If you need to nominate a node _right now_ you can edit /etc/elasticsearch/elasticsearch.yml, make node.master true, and /etc/init.d/elasticsearch restart. If you have time to do a rolling restart (30 minutes) do that instead of a hard restart.

RemoteTransportException looping between 2 nodes

In some rare cases, when a node leaves the cluster (node crashing or cluster restart), a RemoteTransportException seems to bounce between 2 nodes. This ends up generating a huge stacktrace, filling up the logs (/var/log/elasticsearch) and generating a stackoverflow. The cluster recovers by itself after the affected node crashes, but the huge logs might need cleanup. This issue is known and documented on GitHub. It will be fixed by upgrading to elasticsearch > 2.4 (the upgrade is delayed at the moment due to some unrelated issue).

High load

If the Elasticsearch cluster is under high load (not sure how you define "high" at this point) these are likely suspects:

  1. We're performing a "inplace" reindex of a big wiki. Those spike the load higher than normal. We can control how many processess we use for this and if the load is too high we can lower the number. This process can't be paused and restarted but it is fast. I imagine by the time you notice the load is too high the reindex will be done so we'll just have to use fewer processes.
  2. We're performing the second pass of the first index of a wiki (it is two phases) in too many concurrent threads. This phase queues jobs that are much more work for Elasticsearch then the job queue so we really should only ever have one process doing it at a time.
  3. We just deployed some code that made queries less efficient. If this is the case we'll need output from the hot threads api: curl -s 'localhost:9200/_nodes/hot_threads?pretty' | less. Reading that requires knowledge of Elasticsearch's implementation but it is a wonderful tool for figuring out what is going on.

If neither of the above things have happened we'll need the output from the hot threads api even more because it'll provide our only clue for what is going on.

First pass indexing or regular reindexing shouldn't cause high load on Elasticsearch as both are much more work on the job queue.

Unbalanced numbers of shards/Constantly moving shards around/One node using much more disk then the others

Elasticsearch really doesn't like it when some nodes have far fewer shards then other nodes. This can happen if you move shards from node to node to node like we do during a rolling restart. The symptoms are that Elasticsearch will show that it is _constantly_ moving shards around and if you pay close attention, you'll see it swap a shard from Node A to Node B then from Node B to Node A. Insanity.

You can actually count the number of shards on each server using the count function defined (way) below.

Find the machine with too few shards and start moving large shards off of it. This will get you a list of the shards on the machine:

curl -s localhost:9200/_cluster/state?pretty | awk '
    BEGIN {more=1}
    {if (/"nodes"/) nodes=1}
    {if (/"metadata"/) nodes=0}
    {if (nodes && !/"name"/) {node_name=$1; gsub(/[",]/, "", node_name)}}
    {if (nodes && /"name"/) {name=$3; gsub(/[",]/, "", name); node_names[node_name]=name}}
    {if (/"node"/) {node=$3; gsub(/[",]/, "", node)}}
    {if (/"shard"/) {shard=$3; gsub(/[",]/, "", shard)}}
    {if (more && /"index"/) {
        index_name=$3
        gsub(/[",]/, "", index_name)
        print "node="node_names[node]" shard="index_name":"shard
    }}                               
' | sort | grep $server | less

This will get you a list of large shards:

curl -s localhost:9200/_stats?level=shards > /tmp/stats
jq '.indices | keys[] as $index | {
       index: $index,
       shards: ([.[$index].shards[]] | length),
       average_size: ([.[$index].shards[][].store.size_in_bytes] | add / length / 1024 / 1024 / 1024)
   }
| select(.average_size > 2)' /tmp/stats | jq -s 'sort_by(.average_size)'

Now build the command to make Elasticsearch move the shards:

curl -s -XPOST 'localhost:9200/_cluster/reroute?pretty' -d '{
    "commands" : [
        { "move" : {
            "index" : "commonswiki_file_1388437679",
            "shard" : 10,
            "from_node" : "elastic1008",
            "to_node" : "elastic1001"
        } },
        { "move" : {
            "index" : "commonswiki_file_1388437679",
            "shard" : 13,
            "from_node" : "elastic1008",
            "to_node" : "elastic1005"
        } },
        { "move" : {
            "index" : "commonswiki_file_1388437679",
            "shard" : 17,
            "from_node" : "elastic1008",
            "to_node" : "elastic1011"
        } },
        { "move" : {
            "index" : "=ruwikisource_general_1388466902",
            "shard" : 0,
            "from_node" : "elastic1008",
            "to_node" : "elastic1010"
        } }
    ]
}' | head -n 20

Now wait for that to finish and then for Elasticsearch to start moving shards back to that machine. The shards it picks to move back _should_ be random so if you repeat this you'll eventually balance the number of shards out. Once Elasticsearch stops swapping shards back and forth you are free to leave this alone.

This is tracked as an issue in Elasticsearch: https://github.com/elasticsearch/elasticsearch/issues/4790

Other useful functions

I'm not 100% sure what all these will be useful for, but they were useful when replication was having trouble:

function initializing() {
    curl -s localhost:9200/_cluster/state?pretty | awk '
        BEGIN {more=1}
        {if (/"nodes"/) nodes=1}
        {if (/"metadata"/) nodes=0}
        {if (nodes && !/"name"/) {node_name=$1; gsub(/[",]/, "", node_name)}}
        {if (nodes && /"name"/) {name=$3; gsub(/[",]/, "", name); node_names[node_name]=name}}
        {if (/"INITIALIZING"/) initializing=1}
        {if (/"routing_nodes"/) more=0}
        {if (/"node"/) {from_node=$3; gsub(/[",]/, "", from_node)}}
        {if (/"shard"/) {shard=$3; gsub(/[",]/, "", shard)}}
        {if (more && initializing && /"index"/) {
            index_name=$3
            gsub(/[",]/, "", index_name)
            print "from="node_names[from_node]" shard="index_name":"shard
            initializing=0
        }}
    '
}
function moving() {
   curl -s localhost:9200/_cluster/state | jq -c '.nodes as $nodes |
       .routing_table.indices[].shards[][] |
       select(.relocating_node) | {index, shard, from: $nodes[.node].name, to: $nodes[.relocating_node].name}'
}

# Find indexes without an alias - meaning they aren't serving any traffic
curl -s localhost:9200/_cluster/state > /tmp/state
jq '.metadata.indices | keys[] as $index | select(.[$index].aliases | length == 0) | $index' /tmp/state | head
# Just get all the index names
jq '.indices | keys[]' /tmp/stats
# Get the largest indexes
curl -s localhost:9200/_stats?level=shards > /tmp/stats
jq '.indices | keys[] as $index | {
       index: $index,
       shards: ([.[$index].shards[]] | length),
       average_size: ([.[$index].shards[][].store.size_in_bytes] | add / length / 1024 / 1024 / 1024),
       total_size: ([.[$index].shards[][].store.size_in_bytes] | add / 1024 / 1024 / 1024)
   }
| select(.total_size > 5)' /tmp/stats | jq -s 'sort_by(.total_size)' 

Cleanup logs

The servers are configured with copytruncate for logrotate, it means that elastic opens its log file once. To cleanup the main log /var/log/elasticsearch/production-search-eqiad.log it's very important to use truncate -s 0 /var/log/elasticsearch/production-search-eqiad.log to reclaim disk space, using rm will have no effect as the file will remain open from the os perspective, you'll have to restart elastic to actually reclaim that space if you delete this file accidentally. Rotated logs can safely be deleted with rm.