A simple, modern IRC client library in C++. Part #1, the parser

By , last updated July 7, 2019

I’ve started with a side project, making a “smart” chatterbot, which should be able to respond to given commands from other users on IRC (Internet Relay Chat).

The first hurdle is to make the bot connect to a network and stay connected. In order to do that, it needs to understand the IRC format. The IRC RFC started with version RFC 1459, and was later separated into different specifications. RFC 2810 (architecture), RFC 2811 (channel), RFC 2812 (client) and RFC 2813 (server).

All code in this article is available from the SmartAss GitHub repository with a very permissible license (MIT). The actual IRC code is in it’s own library. This IRC library is designed like an engine. There is no connection management. The user program is expected to provide the inputs and outputs (from network, or file or whatever). Most communication with this library will be with events / callback.

Prerequisites

To begin with, if you have some some knowledge of IRC and the protocol, you can start with the IRC client document (RFC 2812) for making a client.

But before we get there, we must able to break down each message from the server, and make something meaningful out of it. To do that, we need a parser.

IRC protocol format

The specification for the IRC protocol states, from a continuous stream of octets, separated by carriage return + newline,(rn), extract data.

In other words, from an incoming buffer of characters, grab each line and parse them. Sounds easy, and it is.

When receiving data from a socket, you get some bytes and the length of the bytes. It’s not guaranteed that those bytes will be a full message, or many messages, or any message at all. But for brevity, we’ll assume that for now.

Split by carriage return and newline (0x13 + 0x10).

Some libraries contain a string split method, but the bulk of it are designed to delimit by one or more delimiters in a delimiter string. This is the opposite of what we want. We want to split strings by rn, and only rn, as the IRC specification tells us. It’s not too difficult to roll our own. Put this in a header somewhere and include in when necessary.

// Not completely tested, use with caution ...
template<typename STR>
void split(std::vector<STR> & out, const STR & in, const STR & delimiter)
{
    const auto npos = STR::npos;
    const auto delsize = delimiter.size();
    size_t offset = 0;
    size_t endpos = 0;
    size_t len = 0;

    do
    {
        endpos = in.find(delimiter, offset);
        STR tmp;

        if (endpos != npos)
        {
            len = endpos - offset;
            tmp = in.substr(offset, len);
            out.push_back(tmp);

            // Prepare next round
            offset = endpos + delsize;
        }
        else
        {
            // Final, or nothing found
            tmp = in.substr(offset);
            out.push_back(tmp);
            break;
        }

    }
    while (endpos != npos);

}

Then, for each line which is not empty, call this method to create an IrcMessage. An IrcMessage is an object containing the decoded data in and easy to use interface.

Parsing of each line

The IRC format is easy (for a human) to understand. It’s not too simple to elegantly parse the data efficiently. I’m not saying I’ve done so, but here is my attempt.

IrcMessage IrcParser::parseLine(const std::string & message)
{
    if (message.empty())
    {
        // Garbage in, garbage out
        return IrcMessage();
    }

If there is no data, then just silently ignore it.

    // https://tools.ietf.org/html/rfc1459	-- Original specification
    // https://tools.ietf.org/html/rfc2810	-- Architecture specfication
    // https://tools.ietf.org/html/rfc2811	-- Channel specification
    // https://tools.ietf.org/html/rfc2812	-- Client specification
    // https://tools.ietf.org/html/rfc2813	-- Server specification
    //
    // <message>  ::= [':' <prefix> <SPACE> ] <command> <params> <crlf>
    // <prefix>   ::= <servername> | <nick> [ '!' <user> ] [ '@' <host> ]
    // <command>  ::= <letter> { <letter> } | <number> <number> <number>
    // <SPACE>    ::= ' ' { ' ' }
    // <params>   ::= <SPACE> [ ':' <trailing> | <middle> <params> ]
    //
    // <middle>   ::= <Any *non-empty* sequence of octets not including SPACE
    //                or NUL or CR or LF, the first of which may not be ':'>
    // <trailing> ::= <Any, possibly *empty*, sequence of octets not including
    //                  NUL or CR or LF>
    //
    // <crlf>     ::= CR LF

    // Parameters are between command and trail
    auto trailDivider = message.find(" :");
    bool haveTrailDivider = trailDivider != message.npos;

    // Assemble outputs
    std::vector<std::string> parts;
    std::string prefix;
    std::string command;
    ParamType parameters;
    std::string trail;

The format can best be described like this, where everything in square brackets are optional:

[:prefix ]COMMAND[ parameter1 [parameter2]][ :trail]


    // With or without trail
    if (haveTrailDivider)
    {
        // Have trail, split by trail
        std::string uptotrail = message.substr(0, trailDivider);
        trail = message.substr(trailDivider + 2);
        boost::split(parts, uptotrail, boost::is_any_of(" "));
    }
    else
    {
        // No trail, everything are parameters
        boost::split(parts, message, boost::is_any_of(" "));
    }

Up to this point, find out if we have trail and split that off into the trail variable, while splitting the first section by space.

    enum class DecoderState
    {
        PREFIX,
        COMMAND,
        PARAMETER
    } state;

    bool first = true;
    state = DecoderState::PREFIX;

    for (const std::string & part : parts)
    {
        switch (state)
        {
            // Prefix, or command... have to be decided
            case DecoderState::PREFIX:
            case DecoderState::COMMAND:
            {
                // Prefix, aka origin of message
                bool havePrefix = part[0] == ':';

                if (havePrefix && first)
                {
                    // Oh the sanity
                    if (part.size() < 2)
                    {
                        return IrcMessage();
                    }

                    // Have prefix
                    state = DecoderState::COMMAND;
                    prefix = part.substr(1);
                    first = false;
                }
                else
                {
                    // Have command
                    state = DecoderState::PARAMETER;
                    command = part;
                }

                break;
            }
            case DecoderState::PARAMETER:
            {
                parameters.push_back(part);
                break;
            }
        }
    }

Then pick the remaining parts and figure out what they are. When prefix and command have been decided, anything more are parameters to the command.

For example, the following lines is a join and then message to a #channel.

:nick!user@10.0.0.1 JOIN #channel<<<
:nick!user@10.0.0.1 PRIVMSG #channel :test<<<


    // Construct an IrcMessage
    IrcMessage ircmsg(command, prefix, std::move(parameters), trail);
    m_Handles(ircmsg);

    return ircmsg;
}

After this, maintaining a stable Irc state is much easier when all messages follow the same format as the type IrcMessage. The parser is in this file https://github.com/Studiofreya/smartass/blob/master/irclib/IrcParser.cpp.

The full code is available at GitHub.