{"id":2964,"date":"2013-07-29T23:22:09","date_gmt":"2013-07-29T23:22:09","guid":{"rendered":"http:\/\/www.readytext.co.uk\/?p=2964"},"modified":"2013-11-28T07:45:12","modified_gmt":"2013-11-28T07:45:12","slug":"searching-for-arabic-text-in-utf-8-encoding-using-pcre","status":"publish","type":"post","link":"https:\/\/www.readytext.co.uk\/?p=2964","title":{"rendered":"Searching for Arabic text in UTF-8 encoding using PCRE"},"content":{"rendered":"<h1>A simple example to get you started<\/h1>\n<p>Based on code generated by the <em>superb<\/em> <a href=\"http:\/\/www.regexbuddy.com\/\">RegexBuddy<\/a> software (the price is great value!), here&#8217;s a simple example of using the <a href=\"http:\/\/www.pcre.org\/\">PCRE regular expression library<\/a> to search a UTF-8 text buffer for strings of Arabic text. The actual regular expression is very simple: <code>([\\\\x{600}-\\\\x{6FF}]+)<\/code> &ndash; it just looks for sequences of Unicode codepoints from 600 (hex) to 6FF (hex). Not a particularly efficient function but it works &ndash; e.g., should calculate buffer length once etc. <\/p>\n<p>I used code like this in an Arabic text pre-processor I wrote for working with XeTeX: saving Arabic strings to a file (from XeTeX), processing the text and reading it back in via <code>\\input{...}<\/code>. Special effects not directly possible in XeTeX can be achieved by a pre-processing step. Yep, involves lots of <code>\\write18{...}<\/code> calls. For sure LuaTeX offers many other possibilities but XeTeX&#8217;s font handling (and use of <a href=\"http:\/\/www.freedesktop.org\/wiki\/Software\/HarfBuzz\/\">HarfBuzz<\/a>) are very convenient indeed!<\/p>\n<pre class=\"brush: plain; light: false; title: ; toolbar: true; notranslate\" title=\"\">\r\n\/\/ Called with a buffer containing UTF-8 encoded text\r\nvoid runpcre(unsigned char * buffer)\r\n{\r\n\r\nint wordcount;\r\npcre *myregexp;\r\nconst char *error;\r\nint erroroffset;\r\nint offsetcount;\r\nint offsets&#x5B;(1+1)*3]; \/\/ (max_capturing_groups+1)*3\r\nunsigned char *res;\r\nwordcount = 0;\r\n\r\nmyregexp = pcre_compile(&quot;(&#x5B;\\\\x{600}-\\\\x{6FF}]+)&quot;,   PCRE_UTF8|PCRE_UCP  , &amp;error, &amp;erroroffset, NULL);\r\nif (myregexp != NULL) {\r\n\toffsetcount = pcre_exec(myregexp, NULL, buffer, strlen(buffer), 0, 0, offsets, (1+1)*3);\r\n\twhile (offsetcount &gt; 0) {\r\n\t\t\/\/ match offset = offsets&#x5B;0];\r\n\t\t\/\/ match length = offsets&#x5B;1] - offsets&#x5B;0];\r\n\t\tif (pcre_get_substring(buffer, &amp;offsets, offsetcount, 0, &amp;res) &gt;= 0) {\r\n\t\t\t\r\n\t\t\twordcount++;\r\n\t\t\t\/\/ Do something with match we just stored into res\r\n\t\t\t\/\/ process_string could be what ever you want to do with the Arabic test string\r\n\t\t\tprocess_string(res, wordcount);   \r\n\t\t}\r\n\t\toffsetcount = pcre_exec(myregexp, NULL, buffer, strlen(buffer), offsets&#x5B;1], 0, offsets, (1+1)*3);\r\n\t} \r\n} else {\r\n\t\/\/ DOH! Syntax error in the regular expression at erroroffset\r\n}\r\n\r\n}<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>A simple example to get you started Based on code generated by the superb RegexBuddy software (the price is great value!), here&#8217;s a simple example of using the PCRE regular expression library to search a UTF-8 text buffer for strings of Arabic text. The actual regular expression is very simple: ([\\\\x{600}-\\\\x{6FF}]+) &ndash; it just looks [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[12,17,10],"tags":[],"class_list":["post-2964","post","type-post","status-publish","format-standard","hentry","category-arabic","category-unicode-arabic","category-unicode"],"blocksy_meta":[],"_links":{"self":[{"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=\/wp\/v2\/posts\/2964","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=2964"}],"version-history":[{"count":18,"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=\/wp\/v2\/posts\/2964\/revisions"}],"predecessor-version":[{"id":3214,"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=\/wp\/v2\/posts\/2964\/revisions\/3214"}],"wp:attachment":[{"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=2964"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=2964"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.readytext.co.uk\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=2964"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}