Searching for Arabic text in UTF-8 encoding using PCRE

A simple example to get you started

Based on code generated by the superb RegexBuddy software (the price is great value!), here’s a simple example of using the PCRE regular expression library to search a UTF-8 text buffer for strings of Arabic text. The actual regular expression is very simple: ([\\x{600}-\\x{6FF}]+) – it just looks for sequences of Unicode codepoints from 600 (hex) to 6FF (hex). Not a particularly efficient function but it works – e.g., should calculate buffer length once etc.

I used code like this in an Arabic text pre-processor I wrote for working with XeTeX: saving Arabic strings to a file (from XeTeX), processing the text and reading it back in via \input{...}. Special effects not directly possible in XeTeX can be achieved by a pre-processing step. Yep, involves lots of \write18{...} calls. For sure LuaTeX offers many other possibilities but XeTeX’s font handling (and use of HarfBuzz) are very convenient indeed!

// Called with a buffer containing UTF-8 encoded text
void runpcre(unsigned char * buffer)
{

int wordcount;
pcre *myregexp;
const char *error;
int erroroffset;
int offsetcount;
int offsets[(1+1)*3]; // (max_capturing_groups+1)*3
unsigned char *res;
wordcount = 0;

myregexp = pcre_compile("([\\x{600}-\\x{6FF}]+)",   PCRE_UTF8|PCRE_UCP  , &error, &erroroffset, NULL);
if (myregexp != NULL) {
	offsetcount = pcre_exec(myregexp, NULL, buffer, strlen(buffer), 0, 0, offsets, (1+1)*3);
	while (offsetcount > 0) {
		// match offset = offsets[0];
		// match length = offsets[1] - offsets[0];
		if (pcre_get_substring(buffer, &offsets, offsetcount, 0, &res) >= 0) {
			
			wordcount++;
			// Do something with match we just stored into res
			// process_string could be what ever you want to do with the Arabic test string
			process_string(res, wordcount);   
		}
		offsetcount = pcre_exec(myregexp, NULL, buffer, strlen(buffer), offsets[1], 0, offsets, (1+1)*3);
	} 
} else {
	// DOH! Syntax error in the regular expression at erroroffset
}

}