I've done some prototyping as well, no actual splitting yet though.
<?php
php
/**
* Trim reset, default colour and default charset sequences from
* the beginning and end of the string (or strings, if an array is
* passed, in which case it will return an array instead of a string).
* This does not trim whitespaces or any other characters trim() trims.
*/
function trim_sequences($string)
{
return preg_replace('/^(?:\^[89L])*(.*?)(?:\^[0-9LGCJETBHSK])*$/', '$1', $string);
}
/**
* Strip redundant or chained sequences.
*
* Example:
* '^H^S^4Stuff^8^9Other stuff'
* becomes
* '^S^4Stuff^8Other stuff'
*/
function optimise($string)
{
$patterns = array(
'/(\^[012345679])+/', // colours and colrst
'/(\^[LGCJETBHSK])+/', // charsets
// full reset (col and cp) sequence ^8
'/(?:\^[0-9LGCJETBHSK])*(\^8)(?:\^[89L])*/',
'/\^8(\^[0-7]\^[GCJETBHSK]|\^[GCJETBHSK]\^[0-7])/',
);
// strip redundant / chained sequences
return preg_replace($patterns, '$1', $string);
}
/**
* Determine if the given character is the lead byte of a multibyte
* character in the given codepage, passed as LFS language identifier.
*/
function isMultiByte($char, $cp)
{
$char = (int)$char;
switch($cp)
{
case 'J':
return ($char > 0xE0 || $char > 0x80 && $char <= 0xA0);
case 'K':
case 'H':
case 'S':
return ($char > 0x80);
default:
return false;
}
}
?>
trim_sequences passes all the test cases I could think of.
optimise also does what it's supposed to do, but does not detect redundancy over the full string width, it's limited to a sequence chain. This is because I wouldn't know how to achieve it with regex assertions and didn't want to iterate over the string, because the actual splitter is going to do that anyway. Put simply, it will correct:
^L^B^H^HSomething to
^HSomething, but
^L^B^H^HSome^Hthing will still have that redundant ^H between 'Some' and 'thing', i.e.
^HSome^Hthing
isMultiByte is completely untested, but it should work under the unconfirmed assumption that all multibyte codepages LFS uses do not have characters wider than 2 bytes. If your split point tests positive with isMultiByte, split one earlier and you should be fine.
As I said initially though, all of those are prototypes and haven't undergone any actual testing with PRISM / LFSWorldSDK, or raw captured strings from LFS for that matter.