Tool:XTools

From Wikitech
Toolforge tools
XTools
Website https://xtools.wmcloud.org/
Description Suite of tools to analyze user and page data on WMF wikis
Keywords xtools, statistics, analytics
Author(s) Matthewrbowker, MusikAnimal, Samwilson
adapted from older version by X! and Hedonil
Maintainer(s) Matthewrbowker, MusikAnimal, Samwilson (View all)
Source code GitHub (Mirrored on Phabricator as rXTR)
License GNU General Public License 3.0 or later
Issues Open tasks · Report a bug
Admin log Nova Resource:Tools.xtools/SAL
Nova Resource:Tools.xtools-dev/SAL

This page is for documentation relating to the WMF installations of XTools at xtools.wmcloud.org (production) and xtools-dev.wmcloud.org (staging). For general documentation including installation, configuration, and development of XTools, please see mw:XTools.

There are three instances currently configured, one for staging, one for the main production app server, and one for the API (view details in the Openstack browser). The prod instances relate to the Toolforge account xtools, and the staging instance relates to xtools-dev; these are where the matching database users come from, and where we send maintainers' emails.

Note to maintainers: we don't backup any server configuration, so please document everything here (until task T170514 is resolved).

Contact

The maintainers can be emailed at tools.xtoolsattoolforge.org (note that this means that the maintainers of three separate things need to be kept in sync: the VPS account and the two Toolforge accounts).

Production

Production XTools is hosted on a Cloud VPS instance. To log into the server, make sure you've been added as a maintainer of the xtools project. Then connect to the instance with ssh xtools-prod08.xtools.eqiad1.wikimedia.cloud (replacing N with the desired instance number) and go to the /var/www directory. Not quite everything in this directory is in the Git repository.

Logs are written to /var/www/var/log/prod.log, but only during a request where an error or high-priority log entry was made. This is why you'll see debug-level log entries in prod.log. You might also need to check /var/log/apache2/error.log for Apache-level errors.

OAuth consumer: XTools 1.2

Database: xtools on uqr2bpkaufm.svc.trove.eqiad1.wikimedia.cloud (Trove database)

Replicas: connect as user s53003 (credentials for the tools.xtools-dev account on Toolforge). This user was given more quota on concurrent connections, so it is important to use only this user for production instances.

Web server configuration is all in /etc/apache2/sites-available/xtools.conf. You'll need to at minimum run sudo service apache2 reload to make these changes take effect.

There's a /var/www/deploy.sh script that runs every 10 minutes (from www-data's crontab) and if required updates to the latest release. The output of this is emailed to the maintainers.

There is also a dedicated API server, which lives at xtools-prod09.xtools.eqiad1.wikimedia.wmcloud. All requests to https://xtools.wmcloud.org/api go to this server.

Building a new instance

First create a new instance running on Debian Bullseye. The name should be the same as the old one but with the number incremented, such as xtools-prod08. The main production node should have an instance flavor with 4 VCPUs and at least 8GB of RAM, while the API server should have at least 2 cores and 4GB of RAM. Disk space is less of a concern and 20GB should suffice. All nodes should be in the default and web security groups.

Once the instance has been spawned, you'll need to wait for the Puppet run before you're able to SSH in, which roughly takes around 10 minutes.

After SSHing into the instance, follow these steps:

  1. Install PHP and Apache, along with some dependencies:
    sudo apt-get update
    sudo apt-get install -y apache2 php7.4 php7.4-cli php7.4-common php7.4-curl php7.4-json php7.4-mysql php7.4-intl php7.4-xml php7.4-mbstring libapache2-mod-php7.4 zip unzip php7.4-zip php7.4-apcu
    sudo a2dismod mpm_event && sudo a2enmod mpm_prefork && sudo a2enmod php7.4
    
  2. Install composer by following these instructions, but make sure to install to the /usr/local/bin directory and with the filename composer, e.g.:
    sudo php composer-setup.php --install-dir=/usr/local/bin --filename=composer
    
  3. Clone the repository, first removing the html directory created by Apache:
    cd /var/www && sudo rm -rf html
    sudo git clone https://github.com/x-tools/xtools.git .
    
  4. Run sudo cp .env .env.local and fill in the necessary details, using mw:XTools/Development/Configuration as a guide. For most options you can use the default. In particular, be sure APP_ENV is set to prod (even for a staging server).
  5. Run sudo composer install --no-dev --optimize-autoloader, entering yes if you get a warning about running as root. Moving forward, we won't ever run composer as root, but rather the Apache server user, www-data (see steps #7 and #12 below).
  6. Create the deploy script at /var/www/deploy.sh with the following:
    #!/bin/bash
    
    cd /var/www
    
    git fetch --quiet origin 2>&1
    
    ## Find the highest and current tags
    HIGHEST_TAG=$(git tag | sort --version-sort | tail --lines 1)
    CURRENT_TAG=$(git describe --tags)
    
    ## Exit and say nothing if we're already at the highest tag.
    if [[ $CURRENT_TAG == $HIGHEST_TAG ]]; then
        # The following line can be temporarily uncommented as-needed
        # to force www-data to clear the production cache:
        # ./bin/console cache:clear --env prod
        exit 0
    fi
    
    ## If there's an update, pull and install it.
    git checkout $HIGHEST_TAG
    /usr/local/bin/composer install --no-dev --optimize-autoloader
    ./bin/console cache:clear --env prod
    ./bin/console doctrine:migrations:migrate --env prod --no-interaction
    
  7. Make sure the scripts are executable, and that all the files in the repo are owned by www-data:
    sudo chmod 744 deploy.sh
    sudo chown -R www-data:www-data .
    
  8. Create the web server configuration file at /etc/apache2/sites-available/xtools.conf with the following:
    <VirtualHost *:80>
            DocumentRoot /var/www/public
            ServerName xtools.wmcloud.org
    
            # These requests aren't logged by Apache.
            SetEnvIf Request_URI "(^/robots\.txt$|^\/api\/|^\/images\/|^\/assets\/|^\/fonts\/|^\/i18n\/)" dontlog=yes
    
            # Requests with these user agents are denied.
            SetEnvIfNoCase User-Agent "(uCrawler|Baiduspider|CCBot|scrapy\.org|kinshoobot|YisouSpider|Sogou web spider|yandex\.com\/bots|twitterbot|TweetmemeBot|SeznamBot|datasift\.com\/bot|Googlebot|Yahoo! Slurp|Python-urllib|BehloolBot|MJ13bot|SemrushBot|facebookexternalhit|rcdtokyo\.com|Pcore-HTTP|yacybot|ltx71|RyteBot|bingbot|python-requests|Cloudflare-AMP|Mr\.4x3|MSIE 7\.0; AOL 9\.5|Acoo Browser|AcooBrowser|MSIE 6\.0; Windows NT 5\.1; SV1; QQDownload|\.NET CLR 2\.0\.50727|MSIE 7\.0; Windows NT 5\.1; Trident\/4\.0; SV1; QQDownload|Frontera|tigerbot|Slackbot|Discordbot|LinkedInBot|BLEXBot|filterdb\.iss\.net|SemanticScholarBot|FemtosearchBot|BrandVerity|Zuuk crawler|archive\.org_bot|mediawords bot|Qwantify\/Bleriot|Pinterestbot|EarwigBot|Citoid \(Wikimedia|GuzzleHttp|PageFreezer|Java\/|SiteCheckerBot|Re\-re Studio|^R \(|GoogleDocs|WinHTTP|cis455crawler|WhatsApp|Archive\-It|lua\-resty\-http|crawler4j|libcurl|dygg\-robot|GarlikCrawler|Gluten Free Crawler|WordPress|Paracrawl|7Siters|Microsoft Office Excel|msnbot|AhrefsBot|MauiBot|Linespider|Symfony BrowserKit|AppleNewsBot|Go-http-client|CoolToolName|UsedBaseLibrary|Archive Team|WoTBoT|Rustbot|ApeSearch|^curl\/|aiohttp)" bad_bot=yes
    
            LogFormat "%{X-Forwarded-For}i %t \"%r\" %>s \"%{Referer}i\" \"%{User-Agent}i\"" xtools
            CustomLog ${APACHE_LOG_DIR}/access.log xtools expr=!(reqenv('bad_bot')=='yes'||reqenv('dontlog')=='yes')
            CustomLog ${APACHE_LOG_DIR}/denied.log xtools expr=(reqenv('bad_bot')=='yes')
            CustomLog ${APACHE_LOG_DIR}/attacks.log xtools expr=(reqenv('attacker')=='yes')
            ErrorLog ${APACHE_LOG_DIR}/error.log
    
            AllowEncodedSlashes On
    
            <Directory /var/www/public/>
                 Options Indexes FollowSymLinks
                 AllowOverride All
                 Require all granted
            </Directory>
    
            <Directory /var/www/>
                    Options Indexes FollowSymLinks
                    AllowOverride None
                    Require all granted
                    Deny from env=bad_bot
            </Directory>
    
            Alias /awstatsclasses "/usr/share/awstats/lib/"
            Alias /awstats-icon/ "/usr/share/awstats/icon/"
            Alias /awstatscss "/usr/share/doc/awstats/examples/css"
            ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/
            ScriptAlias /awstats/ /usr/lib/cgi-bin/
            <Directory /usr/lib/cgi-bin/>
                    Options +ExecCGI -MultiViews +SymLinksIfOwnerMatch
                    Require all granted
            </Directory>
            
            ErrorDocument 403 "Your access to XTools has been blocked due to apparent abuse or disruptive automation. If you are a bot, please use our public APIs instead, which are optimized for this purpose: https://www.mediawiki.org/wiki/XTools/API For inquiries, please contact tools.xtools@toolforge.org"
            RewriteCond "%{HTTP_REFERER}" "^http://127\.0\.0\.1:(5500|8002)/index\.html" [NC]
            RewriteRule .* - [R=403,L]
            RewriteCond "%{HTTP_USER_AGENT}" "^[Ww]get"
            RewriteRule .* - [R=403,L]
            
            RewriteEngine On
            RewriteCond %{HTTP:X-Forwarded-Proto} !https
            RewriteRule ^/?(.*) https://%{SERVER_NAME}/$1 [R=301,L]
    
    </VirtualHost>
    
  9. Setup awstats (this step is optional, but can provide useful statistics on which endpoints are hit the most, which browsers are the most popular, etc.):
    sudo apt-get install awstats
    cd /etc/awstats
    sudo cp awstats.conf awstats.xtools.wmflabs.org.conf
    sudo a2enmod cgi
    
  10. Enable the mod-rewrite Apache module, and enable the web server configuration:
    sudo a2enmod rewrite
    sudo a2ensite xtools
    sudo service apache2 reload
    
  11. Add log rotation to Symfony's logs by creating the file /etc/logrotate.d/symfony with:
    /var/www/var/log/*.log {
            su www-data www-data
            daily
            missingok
            rotate 14
            compress
            delaycompress
            notifempty
            create 640 root adm 
            sharedscripts
            postrotate
                    if /etc/init.d/apache2 status > /dev/null ; then \
                        /etc/init.d/apache2 reload > /dev/null; \
                    fi;
            endscript
            prerotate
                    if [ -d /etc/logrotate.d/httpd-prerotate ]; then \
                            run-parts /etc/logrotate.d/httpd-prerotate; \
                    fi; \
            endscript
    }
    
  12. Setup the crontab to run the deploy script every 10 minutes and the cache clearing script every hour. We'll do this under the www-data user:
    sudo crontab -e -u www-data
    
    Then add this to the bottom of the crontab:
    MAILTO=tools.xtools@toolforge.org
    */10 * * * * /var/www/deploy.sh
    0,30 * * * * sudo /usr/lib/cgi-bin/awstats.pl -config=xtools.wmflabs.org -update > /dev/null
    
  13. Wait for the email indicating composer ran successfully and the cache was cleared. If all goes well, you need only to gracefully (re)start Apache:
    sudo service apache2 graceful
    

Setting up an API server

The API server itself can be built the same as the app server, with some additional proxy settings on the main app server so that all requests to /api go to the API server. You can this by following these steps:

  1. SSH into the new API server, and ensure the replicas database credentials are for user s51187 (Toolforge account tools.xtools-ec). This user has increased quota for concurrent connections, though less than the one used by the main app server.
  2. Now SSH into the main app server.
  3. Install libxml2-dev
    sudo apt-get install libxml2-dev
    
  4. Enable the necessary modules (if some are already enabled it will simply make sure they are active):
    sudo a2enmod proxy proxy_http proxy_ajp rewrite deflate headers proxy_balancer proxy_connect proxy_html xml2enc
    
  5. And in /etc/apache2/sites-available/xtools.conf, within the <VirtualHost> block, add this to the bottom:
    ProxyPreserveHost On
    ProxyPass /api http://X.X.X.X:80/api nocanon
    ProxyPassReverse /api http://X.X.X.X:80/api
    
    ...replacing X.X.X.X with the IP of the API server.
  6. Finally, restart apache with sudo service apache2 graceful

Note that the API server is not accessible at its own domain name.

Maintenance

Sometimes weird things happen. Here are some common problems and quick solutions:

  • Errors about missing cache files – It's a mystery why this happens, but the quick easy fix is to delete the prod cache directory, and it will rebuild on its own.
    cd /var/www && sudo rm -rf var/cache/prod/ && sudo chown -R www-data:www-data . (the chown is for safe measure)
  • ServiceUnavailableHttpException – Something is hammering XTools, hogging up our database quota. This usually only lasts for a minute or two at a time (check the timestamps of the first/last email). If it is persistent, action may be needed:
    1. Check the email for the common user agent.
    2. Grep the Apache logs to make sure there aren't a lot of innocent users with the same UA (sudo cat /var/log/apache2/access.log | grep "Foo")
    3. cd /var/www/ then update config/request_blacklist.yml, optionally using a URI pattern to help avoid affecting innocent users:
      parameters:
          request_blacklist:
              ...
              some_unique_name:
                  user_agent: "Foo"
                  # Target only frwiki, and when a non-French language is requested (which is common for scraping bots)
                  uri_pattern: "fr\\.wikipedia.*?\\?uselang=(?!fr)"
      
    4. Clear the cache with sudo ./bin/console cache:clear --env=prod
    5. For good measure, ensure everything is still owned by www-data (the webserver): sudo chown -R www-data:www-data
    6. You can monitor /var/www/var/log/blacklist.yml to see when the request blacklist is hit.
For more extreme cases, add to the user-agent blacklist in the apache config (/etc/apache2/sites-available/xtools.conf), then reload the config with sudo service apache2 reload.

Staging

SSH to xtools-dev06.xtools.eqiad1.wikimedia.cloud (see notes above in #Production about getting access).

Database: xtools_dev on uqr2bpkaufm.svc.trove.eqiad1.wikimedia.cloud (Trove database)

OAuth consumer: xtools-dev 1.2

Replicas: connect as user s51187 (credentials for the tools.xtools account on Toolforge).


The code on the staging server is kept up to date with the main branch with the following /var/www/deploy.sh script (run every 10 minutes from www-data's crontab):

#!/bin/bash

cd /var/www

## See if there's any update.
GITFETCH=$(git fetch && git diff origin/master 2>&1)
if [ -z "$GITFETCH" ]; then
  exit 0
fi

## If there's an update, pull and install it.
git checkout main
git pull origin main
/usr/local/bin/composer install --no-dev --optimize-autoloader
./bin/console cache:clear --env prod
./bin/console doctrine:migrations:migrate --env prod --no-interaction

Setting up the staging server is the same as production, except you would use a smaller box (m1.small), and use the above deploy script instead of the one that goes off of tags. You also need to update the ServerName in the Apache configuration accordingly.

See also