com.ibm.icu.text

Class UnicodeCompressor

public final class UnicodeCompressor extends Object implements SCSU

A compression engine implementing the Standard Compression Scheme for Unicode (SCSU) as outlined in Unicode Technical Report #6.

The SCSU works by using dynamically positioned windows consisting of 128 consecutive characters in Unicode. During compression, characters within a window are encoded in the compressed stream as the bytes 0x7F - 0xFF. The SCSU provides transparency for the characters (bytes) between U+0000 - U+00FF. The SCSU approximates the storage size of traditional character sets, for example 1 byte per character for ASCII or Latin-1 text, and 2 bytes per character for CJK ideographs.

USAGE

The static methods on UnicodeCompressor may be used in a straightforward manner to compress simple strings:

  String s = ... ; // get string from somewhere
  byte [] compressed = UnicodeCompressor.compress(s);
 

The static methods have a fairly large memory footprint. For finer-grained control over memory usage, UnicodeCompressor offers more powerful APIs allowing iterative compression:

  // Compress an array "chars" of length "len" using a buffer of 512 bytes
  // to the OutputStream "out"

  UnicodeCompressor myCompressor         = new UnicodeCompressor();
  final static int  BUFSIZE              = 512;
  byte []           byteBuffer           = new byte [ BUFSIZE ];
  int               bytesWritten         = 0;
  int []            unicharsRead         = new int [1];
  int               totalCharsCompressed = 0;
  int               totalBytesWritten    = 0;

  do {
    // do the compression
    bytesWritten = myCompressor.compress(chars, totalCharsCompressed, 
                                         len, unicharsRead,
                                         byteBuffer, 0, BUFSIZE);

    // do something with the current set of bytes
    out.write(byteBuffer, 0, bytesWritten);

    // update the no. of characters compressed
    totalCharsCompressed += unicharsRead[0];

    // update the no. of bytes written
    totalBytesWritten += bytesWritten;

  } while(totalCharsCompressed < len);

  myCompressor.reset(); // reuse compressor
 

Author: Stephen F. Booth

See Also:

UNKNOWN: ICU 2.4

Constructor Summary
UnicodeCompressor()
Create a UnicodeCompressor.
Method Summary
static byte[]compress(String buffer)
Compress a string into a byte array.
static byte[]compress(char[] buffer, int start, int limit)
Compress a Unicode character array into a byte array.
intcompress(char[] charBuffer, int charBufferStart, int charBufferLimit, int[] charsRead, byte[] byteBuffer, int byteBufferStart, int byteBufferLimit)
Compress a Unicode character array into a byte array.
voidreset()
Reset the compressor to its initial state.

Constructor Detail

UnicodeCompressor

public UnicodeCompressor()
Create a UnicodeCompressor. Sets all windows to their default values.

See Also: UnicodeCompressor

UNKNOWN: ICU 2.4

Method Detail

compress

public static byte[] compress(String buffer)
Compress a string into a byte array.

Parameters: buffer The string to compress.

Returns: A byte array containing the compressed characters.

See Also: (char [], int, int)

UNKNOWN: ICU 2.4

compress

public static byte[] compress(char[] buffer, int start, int limit)
Compress a Unicode character array into a byte array.

Parameters: buffer The character buffer to compress. start The start of the character run to compress. limit The limit of the character run to compress.

Returns: A byte array containing the compressed characters.

See Also: compress

UNKNOWN: ICU 2.4

compress

public int compress(char[] charBuffer, int charBufferStart, int charBufferLimit, int[] charsRead, byte[] byteBuffer, int byteBufferStart, int byteBufferLimit)
Compress a Unicode character array into a byte array. This function will only consume input that can be completely output.

Parameters: charBuffer The character buffer to compress. charBufferStart The start of the character run to compress. charBufferLimit The limit of the character run to compress. charsRead A one-element array. If not null, on return the number of characters read from charBuffer. byteBuffer A buffer to receive the compressed data. This buffer must be at minimum four bytes in size. byteBufferStart The starting offset to which to write compressed data. byteBufferLimit The limiting offset for writing compressed data.

Returns: The number of bytes written to byteBuffer.

UNKNOWN: ICU 2.4

reset

public void reset()
Reset the compressor to its initial state.

UNKNOWN: ICU 2.4

Copyright (c) 2006 IBM Corporation and others.