A simple example to get you started
Based on code generated by the superb RegexBuddy software (the price is great value!), here’s a simple example of using the PCRE regular expression library to search a UTF-8 text buffer for strings of Arabic text. The actual regular expression is very simple: ([\\x{600}-\\x{6FF}]+) – it just looks for sequences of Unicode codepoints from 600 (hex) to 6FF (hex). Not a particularly efficient function but it works – e.g., should calculate buffer length once etc.
I used code like this in an Arabic text pre-processor I wrote for working with XeTeX: saving Arabic strings to a file (from XeTeX), processing the text and reading it back in via \input{...}. Special effects not directly possible in XeTeX can be achieved by a pre-processing step. Yep, involves lots of \write18{...} calls. For sure LuaTeX offers many other possibilities but XeTeX’s font handling (and use of HarfBuzz) are very convenient indeed!
// Called with a buffer containing UTF-8 encoded text
void runpcre(unsigned char * buffer)
{
int wordcount;
pcre *myregexp;
const char *error;
int erroroffset;
int offsetcount;
int offsets[(1+1)*3]; // (max_capturing_groups+1)*3
unsigned char *res;
wordcount = 0;
myregexp = pcre_compile("([\\x{600}-\\x{6FF}]+)", PCRE_UTF8|PCRE_UCP , &error, &erroroffset, NULL);
if (myregexp != NULL) {
offsetcount = pcre_exec(myregexp, NULL, buffer, strlen(buffer), 0, 0, offsets, (1+1)*3);
while (offsetcount > 0) {
// match offset = offsets[0];
// match length = offsets[1] - offsets[0];
if (pcre_get_substring(buffer, &offsets, offsetcount, 0, &res) >= 0) {
wordcount++;
// Do something with match we just stored into res
// process_string could be what ever you want to do with the Arabic test string
process_string(res, wordcount);
}
offsetcount = pcre_exec(myregexp, NULL, buffer, strlen(buffer), offsets[1], 0, offsets, (1+1)*3);
}
} else {
// DOH! Syntax error in the regular expression at erroroffset
}
}