XmlRcs: Difference between revisions

From Wikitech
Content deleted Content added
we now convert from EventStreams
No edit summary
Line 126: Line 126:
* length_old: size of old edit
* length_old: size of old edit
* timestamp
* timestamp

== Maintainer info ==
Whole thing is living on instance xmlrcs in project huggle. It consists of 3 components, which always need to be started in this order:

* redis server (started by init.d)
* xmlrcsd (server daemon for XmlRcs, can be started as user xmlrcs: ./xmlrcsd -d)
* EventStream to redis daemon (/opt/xmlrcs/es2r.py should be running as user xmlrcs, it's recommended to start this using auto-restart loop script in /opt/xmlrcs: nohup ./start &)



== C# Library ==
== C# Library ==

Revision as of 20:42, 10 August 2017

XmlRcs is proxy for EventStreams (Wikimedia's recent changes feed) that exposes it as XML instead of JSON, using a lightweight TCP connection with a few simple commands.

XmlRcs simplifies access to EventStreams for applications that, for whatever reason, can't use Server-Sent event protocol or JSON.

Wikimedia has had the IRC feed for a long time. While there are numerous problems with it (e.g. the complex data format with IRC color codes, wikitext notation, and embedding of localised interface messages to encode data), the underlying communication protocol (IRC) is relatively easy to implement in any programming language.

This IRC feed, however, has been deprecated and replaced with RCStream, which was again deprecated and replaced with EventStreams, which is supposed to be more stable platform that should make it easy for programmers to retrieve events from Wikimedia sites in real-time. While it may be a better platform in many ways, it does add complexity to the stack. It adds a dependency on third party technologies, such as WebSockets and JSON. While JSON is an easy data format to decode, WebSockets is quite new and lacking good implementations for popular programming languages and frameworks (such as .Net or Qt). In JavaScript or Python, RCStream's WebSocket can be used directly, but it's hard developers working in lower level languages like C or C++.

XmlRcs intends to solve this problem. It introduces a simple and lightweight TCP protocol, using XML packets to encode the event data.

How it works

Flow of XmlRcs
Flow of XmlRcs

This service is another layer behind the WebSockets server. It's implemented as a python daemon that converts the WebSockets and JSON into raw data and put them in Redis, which are then retrieved using a C++ daemon that acts as a server to which clients can connect and subscribe to for various feeds.

The daemon is listening by default on port 8822 (TCP) and running on server huggle-rc.wmflabs.org, example usage:

telnet huggle-rc.wmflabs.org 8822
Trying 208.80.155.196...
Connected to huggle-rc.wmflabs.org.
Escape character is '^]'.
S en.wikipedia.org
<ok></ok>
<edit wiki="enwiki" server_name="en.wikipedia.org" revid="642587049" oldid="625934858" summary="cat" title="Dunbar Douglas, 4th Earl of Selkirk" namespace="0" user="Brendandh" bot="False" patrolled="False" minor="False" type="edit" length_new="4485" length_old="4446" timestamp="1421317382"></edit>
<edit wiki="enwiki" server_name="en.wikipedia.org" revid="642587048" oldid="638351579" summary="Added source and Explanation of how JMB past papers were used to examine present grade inlfation in the British education system." title="Joint Matriculation Board" namespace="0" user="85.3.139.236" bot="False" patrolled="False" minor="False" type="edit" length_new="4990" length_old="4735" timestamp="1421317382"></edit>
<edit wiki="enwiki" server_name="en.wikipedia.org" revid="642587050" oldid="631962647" summary="Added charts section." title="Pacifica (The Presets album)" namespace="0" user="Ss112" bot="False" patrolled="False" minor="False" type="edit" length_new="7697" length_old="6946" timestamp="1421317382"></edit>
exit
Connection closed by foreign host.

As you can see, you only need to connect to port 8822 using TCP and subscribe using simple commands, the output is XML nodes that contain the information about edits.

Commands

Every command is a plain text terminated with a new line

S

Subscribe to a feed, syntax: S <hostname of wiki>

Example: S en.wikipedia.org

You can use magic word "all" to subscribe to all wikis

Response: "<ok></ok>" on success, "<error>reason</error>" on error

D

Remove a subscription, syntax D <hostname of wiki>

Example: D en.wikipedia.org

Using magic word "all" will remove subscription to "all wikis" but in case you were subscribed to other wikis as well, these subscriptions will stay.

Response: "<ok></ok>" on success, "<error>reason</error>" on error

clear

Removes all subscription

Response: "<ok></ok>" on success, "<error>reason</error>" on error

stat

Display various system information

ping

Check if connection is alive,

Response: "<pong></pong>"

exit

Close the connection

Important: you are supposed to send raw text "pong" in case you receive XML node "ping" if you fail to do that, you may be randomly disconnected

Output

In this moment daemon responds always in XML. Each XML node is only on 1 line - terminated by a newline.

error

Example:

meh
<error>Unknown: meh</error>

Non-critical error message

fatal

Example:

<fatal>Redis server is down</fatal>

Critical error which implies that XmlRcs daemon became defunct, this error should be very rare

ok

Example

S this.is.a.test
<ok>S this.is.a.test</ok>

ping

Example

<ping></ping>

Daemon sends randomly these messages to verify if client is still online, if you fail to reply with

pong

you may get disconnected within a minute (note: the reply doesn't need to be "pong" the last_response time gets reset on any input)

edit

Information about wiki edit, example

<edit wiki="wikidatawiki" server_name="www.wikidata.org" revid="188428371" oldid="188099357" summary="/* wbcreateclaim-create:1| */ Property:P361: Q18770801" title="Q17467648" namespace="0" user="RobotMichiel1972" bot="True" patrolled="True" minor="False" type="edit" length_new="5168" length_old="4758" timestamp="1421402947"></edit> 
  • wiki: name of a wiki as a shortcut (enwiki)
  • server_name: fqdn of server (en.wikipedia.org)
  • revid: revision id (54635262)
  • oldid: previous revision id (5635323)
  • summary: summary of edit
  • title: name of page
  • user: name of user
  • bot
  • patrolled
  • minor
  • type: type of edit (edit is regular edit, new is a newpage)
  • length_new: size of new edit
  • length_old: size of old edit
  • timestamp

Maintainer info

Whole thing is living on instance xmlrcs in project huggle. It consists of 3 components, which always need to be started in this order:

  • redis server (started by init.d)
  • xmlrcsd (server daemon for XmlRcs, can be started as user xmlrcs: ./xmlrcsd -d)
  • EventStream to redis daemon (/opt/xmlrcs/es2r.py should be running as user xmlrcs, it's recommended to start this using auto-restart loop script in /opt/xmlrcs: nohup ./start &)


C# Library

There is a C# library: https://github.com/huggle/XMLRCS/tree/master/clients/c%23/XmlRcs

You can download it from "releases" page (precompiled .dll).