com.ibm.icu.text
Class DictionaryBasedBreakIterator
public
class
DictionaryBasedBreakIterator
extends RuleBasedBreakIterator_Old
A subclass of RuleBasedBreakIterator_Old that adds the ability to use a dictionary
to further subdivide ranges of text beyond what is possible using just the
state-table-based algorithm. This is necessary, for example, to handle
word and line breaking in Thai, which doesn't use spaces between words. The
state-table-based algorithm used by RuleBasedBreakIterator_Old is used to divide
up text as far as possible, and then contiguous ranges of letters are
repeatedly compared against a list of known words (i.e., the dictionary)
to divide them up into words.
DictionaryBasedBreakIterator uses the same rule language as RuleBasedBreakIterator_Old,
but adds one more special substitution name: _dictionary_. This substitution
name is used to identify characters in words in the dictionary. The idea is that
if the iterator passes over a chunk of text that includes two or more characters
in a row that are included in _dictionary_, it goes back through that range and
derives additional break positions (if possible) using the dictionary.
DictionaryBasedBreakIterator is also constructed with the filename of a dictionary
file. It uses Class.getResource() to locate the dictionary file. The
dictionary file is in a serialized binary format. We have a very primitive (and
slow) BuildDictionaryFile utility for creating dictionary files, but aren't
currently making it public. Contact us for help.
UNKNOWN: ICU 2.0
Nested Class Summary |
protected class | DictionaryBasedBreakIterator.Builder
The Builder class for DictionaryBasedBreakIterator inherits almost all of
its functionality from the Builder class for RuleBasedBreakIterator_Old, but
extends it with extra logic to handle the DICTIONARY_VAR token |
Method Summary |
int | first()
Sets the current iteration position to the beginning of the text.
(i.e., the CharacterIterator's starting offset). |
int | following(int offset)
Sets the current iteration position to the first boundary position after
the specified position. |
protected int | handleNext()
This is the implementation function for next(). |
int | last()
Sets the current iteration position to the end of the text.
(i.e., the CharacterIterator's ending offset). |
protected int | lookupCategory(char c)
Looks up a character category for a character. |
protected RuleBasedBreakIterator_Old.Builder | makeBuilder()
Returns a Builder that is customized to build a DictionaryBasedBreakIterator.
|
int | preceding(int offset)
Sets the current iteration position to the last boundary position
before the specified position. |
int | previous()
Advances the iterator one step backwards. |
void | setText(CharacterIterator newText) |
void | writeTablesToFile(FileOutputStream file, boolean littleEndian) |
public DictionaryBasedBreakIterator(String description, InputStream dictionaryStream)
Constructs a DictionaryBasedBreakIterator.
Parameters: description Same as the description parameter on RuleBasedBreakIterator_Old,
except for the special meaning of DICTIONARY_VAR. This parameter is just
passed through to RuleBasedBreakIterator_Old's constructor. dictionaryStream the stream containing the dictionary data
UNKNOWN: ICU 2.0
public int first()
Sets the current iteration position to the beginning of the text.
(i.e., the CharacterIterator's starting offset).
Returns: The offset of the beginning of the text.
UNKNOWN: ICU 2.0
public int following(int offset)
Sets the current iteration position to the first boundary position after
the specified position.
Parameters: offset The position to begin searching forward from
Returns: The position of the first boundary after "offset"
UNKNOWN: ICU 2.0
protected int handleNext()
This is the implementation function for next().
UNKNOWN:
public int last()
Sets the current iteration position to the end of the text.
(i.e., the CharacterIterator's ending offset).
Returns: The text's past-the-end offset.
UNKNOWN: ICU 2.0
protected int lookupCategory(char c)
Looks up a character category for a character.
UNKNOWN:
Returns a Builder that is customized to build a DictionaryBasedBreakIterator.
This is the same as RuleBasedBreakIterator_Old.Builder, except for the extra code
to handle the DICTIONARY_VAR tag.
UNKNOWN:
public int preceding(int offset)
Sets the current iteration position to the last boundary position
before the specified position.
Parameters: offset The position to begin searching from
Returns: The position of the last boundary before "offset"
UNKNOWN: ICU 2.0
public int previous()
Advances the iterator one step backwards.
Returns: The position of the last boundary position before the
current iteration position
UNKNOWN: ICU 2.0
public void setText(CharacterIterator newText)
public void writeTablesToFile(FileOutputStream file, boolean littleEndian)
Copyright (c) 2006 IBM Corporation and others.