PCRE Alternative Branch GOTCHA

I've just stumbled upon a quick GOTCHA to note with PCRE and PHP (although likely to be on any language using PCRE) - if you're using the alternative branch syntax to handle multiple patterns (using "|" - the pipe character), you need to order the options according to how you want them to be processed.

What the hell does that mean? Let's take the case where I got caught out - doing a quick preg_replace() to turn newlines into XHTML <br /> elements. As many of you will know, Windows, UNIX and Macs all have different conventions for end-of-line sequences - UNIX uses just the newline character (\n), Macs use just the carriage return character (\r) and Windows uses both (\r\n). So, our replace looks something like :

preg_replace('/\n|\r|\r\n/', "
\n", $data);


You see, PCRE runs over the options and branches and tries to apply the first match it can (as opposed to being greedy (the technical term) and trying to match the longest branch, as I had expected). So, what happens to our strings made in Windows? You guessed it - first pass it replaces \n with our BR, as it's the first branch. It then finds a \r immediately following it - it changes that into a BR too. All that hard work, when what we really wanted though was for it to just make a single line break from "\r\n".

The solution? Just reorder the branches so as that the most complex (\r\n) option is checked first. Fix the pattern to be "/\r\n|\n|\r/", and things work great.

Offtopic -1: Two posts in two days? And the RSS feed fixed? What's going on!?!?!