Tuesday, 4 May 2010

Oracle and unicode characters. Part 1 Content length

In the application I am working on we came across the need to display non-latin characters. We needed to display both Russian and Chinese characters. The Oracle 10g R2 database is configured for UTF-8. Our application is built in Application Express 3.1. Our charts are generated by outputting SVG from PL/SQL. Also PowerPoint presentations are generated through a Java servlet using Apache POI. This servlet is called from PL/SQL.

The non-latin character should be shown in the Apex application, the SVG charts and the PowerPoint output. The Apex application turned out to be no problem. The other forms of output however have cost me a considerable amount of time to solve. During this quest I stumbled over several problems for which I could not easily find the answers using Google. This surprised me, I expected that more people would have had the same problems already.
But this gives me more reason to writes these posts adressing several problems I encountered.
This postwill deal with the content-length of a stream with multi byte from PL/SQL. The next post will discuss which font to use and how to use it. The last post will be about the XML encoding of UTF-8 characters from PL/SQL.

The database is configured for UTF-8 so there is no problem storing the characters in unicode. The data is delivered in UTF-8 files and these characters are loaded into the database using SQL*Loader without any problem.
Sometimes the data is delivered as an Excel file. Then the file can be opened in Excel and saved as Unicode Text. To edit the file we use the Notepad++ editor that supports UTF-8 very well. Do not forget to set the encoding to UTF8 without BOM before saving.

Application Express supports UTF-8 natively. No modification was needed to show the non-latin characters. However in our application we use SVG for charting and a Java servlet for generating PowerPoint presentations. These two have caused me some headaches and a few sleepless nights.

I will discuss the solution for the SVG chart in this post, the generation of the PowerPoint through a servlet will be the subject of my next post.

The SVG charting is coded in PL/SQL. The SVG is built in a CLOB that is output through the HTP package. In the XML header of the SVG the UTF-8 encoding should be specified:


<?xml version="1.0" encoding="utf-8"?>


The UTF-8 characters are added to the SVG without modification. When we generated the first charts with UTF-8 we got errors of incorrect XML. After a lot of searching it turned out that the wrong content length was specified in the download header. We used the value returned by the function DBMS_LOB.GETLENGTH. This function returns however the number of characters in a CLOB. The content-length in the download header should contain the number of bytes.
When all the characters are ASCII characters these numbers are equal. However all of the non-ASCII characters in UTF-8 use more than one byte per character. When there are mutli byte characters in the CLOB the length of the CLOB will be less than the number of characters so the browser will expect less bytes than the total length of the SVG hence it receives not well formed XML. A solution for this problem is to determine the length of the clob in bytes using the following function that utilises the function LENGTHB to determine the length in bytes of a varchar2.


function get_clob_length
( p_clob clob
) return number is
l_no_of_pieces number := null;
l_bufsize number := 2000;
l_string varchar2(10000) := null;
l_start number := 1;
l_length number := null;
l_amount number := null;
l_return number := 0;
begin
l_length := dbms_lob.getlength(p_clob);
l_no_of_pieces := trunc(l_length/l_bufsize) + sign(mod(l_length,l_bufsize));
for i in 1..l_no_of_pieces loop
l_amount := least(l_bufsize,l_length-l_start+1);
l_string := dbms_lob.substr(p_clob,l_amount,l_start);
l_return := l_return + lengthb(l_string);
l_start := l_start + l_bufsize;
end loop;
return(l_return);
end;

Using this function to determine the right value of content-length solved the problem of the not well formed XML. The SVG charts showed, with Russian characters! However, the Chinese characters showed as small squares. Read the next post to find out about the mystery of the missing Chinese characters.