We recently augmented the team working on our desktop product. At the core of the product is XMPP – the protocol that drives several instant messaging servers and clients, sites like Chesspark and now Google Wave. Since XMPP is not known by many people, let alone be understood well enough, every time we on-board someone new, they have to go thru a steep learning curve. This post is an attempt to make it easier to understand the protocol .

 

What is XMPP

XMPP generally refers to a collection of specifications that define protocols for real time interactions over the public internet.  The core set of specifications has been standardized by IETF. While the first application (and the origin) of the protocols was to address instant messaging, the extensible nature of XMPP has led it to be used for a wide range of applications which need real time communication.

 

Architecture

XMPP has a decentralized client-server architecture (like the WWW) – there are several hundreds of XMPP deployments, each running anywhere from one to hundreds of servers to which millions of clients connect. Key aspects:

 

Addressing:

Each interacting entity on XMPP needs to have a unique address. This address is called a Jabber Id (JID). A JID consists of three parts:

user-identifier: Unicode string representing the interacting entity. For example: user1

domain-identifier: Unicode string representing the domain of the interacting entity. For example: directi.com

resource-identifier: Unicode string representing a resource used by the entity. For example: pwdesktop

A full JID looks as follows: <user-identifier>@<domain-identifier>/<resource-identifier>. For example: [email protected] /pwdesktop. An identifier without the resource is  called a bare JID: [email protected] . Another way of addressing is to use XMPP URIs: xmpp:[email protected]

 

Communication:

The original purpose of XMPP was to do instant messaging. This requires the server to be able to push messages out to the client on a real time basis. Over HTTP, this requires the client to poll the server, or a technique like Comet where the server caches HTTP requests and then sends responses on those requests. XMPP however uses a long lived TCP connection. This gives the server an always on channel to push info to the client. The client also need not wait for the server to respond to its messages, but can instead send an indefinite amount of messages to the server without blocking, which the server can then respond to, enabling an asynchronous kind of communication. This is further helped by the fact that XMPP does not require every packet being sent over the wire to be acknowledged. An entity assumes a packet to be delivered unless it receives an error.

 

Protocol:

XMPP uses streaming XML over a long lived TCP connection (though HTTP is also possible thru BOSH – see later) for communication. There is one stream from client to server and another stream from server to client. To start communicating, a client would send an opening XML tag:

<?xml version="1.0"?>

This is followed by a stream element – this marks the beginning or root of the document:

<stream:stream>

This in turn is followed by various messages sent across as XML elements. These elements are called stanzas. These stanzas can continue to get exchanged between the server and the client endlessly till a closing stream tag is sent which marks the end of the communication. There are three kind of stanzas:

1) Message: The <message/> stanza denotes the basic push method for sending stuff from one entity to the other. These need not be acknowledged and provide a quick fire and forget mechanism to send info from one point to the other. A Message has a “to” and a “from” attribute denoting the receiver and the sender, and can include one or more payloads. In IM conversations, the payload is often HTML markup for richly formatted text (defined by XHTML-IM)

2) Presence: The <presence/> stanza denotes a broadcast sent out by an entity to advertise its availability to other entities who have subscribed to receive these updates from the advertising entity. Presence messages can also include a payload. Common uses are to include an availability state like “away,” “busy,” etc. and personal status messages like “Working on blog post 1 of 6.”

3) Information Query: The <iq/> stanza provides a Request-Response mechanism like HTTP verbs GET, PUT and POST. The payload defines the request of the sender which needs to be processed by the receiver. This is the only stanza type where the sender expects a reply – a result or an error. This makes IQ more reliable than a Message, allowing the two entities to carry out a structured interaction. Common examples of IQ usage in IM applications are to fetch the the Roster, and add / remove entries in the Roster.

A closing stream tag indicates end of conversation.

 

Decentralized client-server architecture

XMPP servers talk to each other directly. So if [email protected] needs to interact with [email protected] , the servers at domain1.com would interact with domain2.com servers directly. This is different from email where communication between servers on different domains happens thru hops (which can lead to address spoofing and other issues) or HTTP where servers do not interact with each other at all.

 

 

Creating a XMPP Session

[email protected] wants to talk to [email protected] . To accomplish this:

 

1) Client creates a TCP session with the server:

The client used by user1 needs to find out the box that hosts the XMPP service for domain2. This is accomplished by doing a DNS service lookup by checking the DNS SRV record which maps the service to the machine name and the port of the service (5222 by default) becomes known, a AAAA lookup gives the IP of the machine. With the IP and the port we can now open a TCP connection.

 

2) Client and Server start streams in opposite directions:

a) The client sends across the opening XML text declaration tag (optional) followed by an initial stream header:

<?xml version="1.0"?><stream:stream to="domain2.com"     version="1.0"     xmlns="jabber:client"    xmlns:stream="http://etherx.jabber.org/streams">

b) The server sends back a response stream header with a unique stream id:

<?xml version="1.0"?><stream:stream from="domain2.com"     id="0123456789" version="1.0"     xmlns="jabber:client"    xmlns:stream="http://etherx.jabber.org/streams">

 

3) Client and Server negotiate Stream Features:

Right after sending the response stream header, the server send across a <stream:features> message on the features it supports. These features are typically about:

a) Whether server supports TLS or not (recommended)

b) Authentication mechanism supported (see my earlier blog post on authentication mechanisms) – typically SASL plaintext and digest-md5 are supported. Ideally one should not use plain text without TLS since in that case the password is sent in clear on the wire.

c) Stream compression (optional)

At this point, the client uses <iq/> stanzas to negotiate which features it wants to use, and if TLS is used, enter into a TLS negotiation, or if SASL is used, authenticate via the appropriate SASL mechanism.

 

4) Post Authentication Stream Negotiation

After authentication, the server resets the session by sending a new stream header with a new stream id. This is done for security purposes. This new stream does not publish any authentication features (since that is already done), but now publishes new features. These typically include:

a) Compression support

b) Resource binding

c) Formally starting an XMPP session

At this point, the client again uses <iq/> stanzas to negotiate which features it wants to use, and once that part is over, the actual task of application specific stanza exchange can start between the client and the server.

 

BOSH

I earlier mentioned that XMPP can work over HTTP as well. This seems counter-intuitive: XMPP requires push, and HTPP is pull based (client sends a request and server responds). However, it turns out that one can do push over HTTP as well – the technique for using XMPP over HTTP is called Bidirectional-streams Over Synchronous HTTP (BOSH):

 

1) There is a server in front of the XMPP server which handles HTTP clients. This is called a BOSH connection manager (CM).

 

2) Client sends DNS query for TXT records, and discovers that there is an entry for BOSH connection which points to the BOSH server mentioned above.

 

3) Session Creation Request: Client now sends a HTTP POST with an empty <body/> tag with some attributes set. The important ones are:

a) hold – the number of HTTP requests the BOSH server can queue. This is typically set to 1

b) wait – the timeout in seconds before which the server must respond to a pending request

c) rid – a large random number that acts as the initial request id

 

4) Session Creation Response: BOSH server opens a regular XMPP stream with the XMPP server over a TCP connection, receives the server’s XMPP response, wraps it up in a <body/> tag and returns this to the client over HTTP. The body tag contains the following attributes:

a) hold – same as earlier

b) requests – max number of HTTP requests that the client can open with the BOSH server at any time. This is typically set to hold + 1. Since hold is typically 1, requests is typically 2.

c) sid – a large random number that acts as the session id. This is diff from the stream id sent by the XMPP server. The client must now include the sid in every subsequent request.

 

5) Hereafter the client and the server negotiate stream features and authenticate pretty much in the same way as with a TCP connection, with the BOSH server sitting in between and wrapping / unwrapping the <stream> and <iq/> stanzas in <body/> tags. The XMPP application is now ready to exchange its specific stanzas.

 

6) The question at this stage is, how does the server do a push. Recall that hold=1 is the max number of HTTP requests the BOSH server can queue, and requests=2 is the max number of HTTP requests that the client can open with the BOSH server at any time. Assume that the last request from the client was sent to the BOSH server just around 60 seconds back (let’s say a <presence/> packet) . The server had nothing to respond because the XMPP server had no stanzas. Now since the 60 seconds timeout is about to be over, this is what happens:

a) BOSH server returns a HTTP 200 ok response to the last request with an empty <body/> tag. If the client too has nothing to say, it also sends across a HTTP 200 ok with an empty <body/> tag, to which the server can again respond at the 60 second timeout. This can go on ad-infiniteum as a keep-alive mechanism.

b) Now assume that the client has something to say. One request is already in the play and max two are allowed. So the client can now send the new stanza in a new HTTP request. The BOSH server immediately responds to the earlier request (which was kept on hold) with a HTTP 200 ok with empty <body/> tag. It now again has one request outstanding which it can use to either send back the keep-alive, or send back a response from the XMPP server.

c) Assume that the client and the server have been playing the keep alive game. Now the XMPP server sends a stanza (say an authorization request). The BOSH server needs to push this to the client. This is easy since it has a cached HTTP request. Hence push is accomplished over HTTP, without doing constant polling.

 

The advantage of using BOSH is that it can work even in flaky networks where a TCP connection would break, forcing the client to once again establish an XMPP session. Also, this makes it possible to use XMPP in web clients where one cannot open a TCP connection, for example, Facebook’s chat feature uses XMPP over BOSH.

 

Jingle

XMPP uses a client-server model for all communication and is optimized for small snippets of info. So if the amount of data to be exchanged is very large, for example in applications like file transfer, audio-video calls and screen-sharing, an XEP called Jingle. Jingle is large and complex enough to deserve its own series of posts. I will summarize basic facts here:

1) The basic idea behind Jingle (and other multimedia protocols like SIP) is to use two channels:

a) Signaling Channel to set up, manage and tear down application defined sessions

b) Media Channel to transfer the payload either peer to peer or relayed thru a mediator over a application defined transport

 

2) In a Jingle negotiation (<jingle/> element inside a <iq/> element),  the initiator makes an offer to start a session by declaring one or more app type (say voice video, etc.) and a transport method (ICE, UDP, etc.). The responder and the initiator then negotiate a set of parameters (for example codecs to be used), and if the negotiation works data is exchanged. Some parameters can be modified even while the data is being exchanged.

 

3) Jingle supports two transport types:

a) Datagram transports like UDP – can tolerate packet loss – meant for apps like media streaming

b) Streaming transports like TCP – no packet loss tolerated – for example file transfer

 

4) The real power of Jingle comes from using Jingle over ICE. ICE provides a mechanism for two entities to communicate and negotiate all possible ways of connecting between each other – direct or mediated. ICE in turn can use a STUN server to find out the IP address and port of an endpoint from outside the firewall, and a TURN server to relay data in case a direct peer to peer connection is not available.

  • Делян Ангелов

    Very usefull introduction – very clear and easy to understand. Thank you.

  • Earth Lander

    I’ve noticed that your articles have only “part 1″ released. Where’s part 2, 3, …, etc?

blog comments powered by Disqus