com.ibm.icu.text

Class CharsetMatch

public class CharsetMatch extends Object implements Comparable

This class represents a charset that has been identified by a CharsetDetector as a possible encoding for a set of input data. From an instance of this class, you can ask for a confidence level in the charset identification, or for Java Reader or String to access the original byte data in Unicode form.

Instances of this class are created only by CharsetDetectors.

Note: this class has a natural ordering that is inconsistent with equals. The natural ordering is based on the match confidence value.

UNKNOWN: ICU 3.4 This API might change or be removed in a future release.

Field Summary
static intBOM
Bit flag indicating the match is based on the presence of a BOM.
static intDECLARED_ENCODING
Bit flag indicating he match is based on the declared encoding.
static intENCODING_SCHEME
Bit flag indicating the match is based on the the encoding scheme.
static intLANG_STATISTICS
Bit flag indicating the match is based on language statistics.
Method Summary
intcompareTo(Object o)
Compare to other CharsetMatch objects.
intgetConfidence()
Get an indication of the confidence in the charset detected.
StringgetLanguage()
Get the ISO code for the language of the detected charset.
intgetMatchType()
Return flags indicating what it was about the input data that caused this charset to be considered as a possible match.
StringgetName()
Get the name of the detected charset.
ReadergetReader()
Create a java.io.Reader for reading the Unicode character data corresponding to the original byte data supplied to the Charset detect operation.
StringgetString()
Create a Java String from Unicode character data corresponding to the original byte data supplied to the Charset detect operation.
StringgetString(int maxLength)
Create a Java String from Unicode character data corresponding to the original byte data supplied to the Charset detect operation.

Field Detail

BOM

public static final int BOM
Bit flag indicating the match is based on the presence of a BOM.

See Also: CharsetMatch

UNKNOWN: ICU 3.4 This API might change or be removed in a future release.

DECLARED_ENCODING

public static final int DECLARED_ENCODING
Bit flag indicating he match is based on the declared encoding.

See Also: CharsetMatch

UNKNOWN: ICU 3.4 This API might change or be removed in a future release.

ENCODING_SCHEME

public static final int ENCODING_SCHEME
Bit flag indicating the match is based on the the encoding scheme.

See Also: CharsetMatch

UNKNOWN: ICU 3.4 This API might change or be removed in a future release.

LANG_STATISTICS

public static final int LANG_STATISTICS
Bit flag indicating the match is based on language statistics.

See Also: CharsetMatch

UNKNOWN: ICU 3.4 This API might change or be removed in a future release.

Method Detail

compareTo

public int compareTo(Object o)
Compare to other CharsetMatch objects. Comparison is based on the match confidence value, which allows CharsetDetector.detectAll() to order its results.

Parameters: o the CharsetMatch object to compare against.

Returns: a negative integer, zero, or a positive integer as the confidence level of this CharsetMatch is less than, equal to, or greater than that of the argument.

Throws: ClassCastException if the argument is not a CharsetMatch.

UNKNOWN: ICU 3.4 This API might change or be removed in a future release.

getConfidence

public int getConfidence()
Get an indication of the confidence in the charset detected. Confidence values range from 0-100, with larger numbers indicating a better match of the input data to the characteristics of the charset.

Returns: the confidence in the charset match

UNKNOWN: ICU 3.4 This API might change or be removed in a future release.

getLanguage

public String getLanguage()
Get the ISO code for the language of the detected charset.

Returns: The ISO code for the language or null if the language cannot be determined.

UNKNOWN: ICU 3.4 This API might change or be removed in a future release.

getMatchType

public int getMatchType()
Return flags indicating what it was about the input data that caused this charset to be considered as a possible match. The result is a bitfield containing zero or more of the flags ENCODING_SCHEME, BOM, DECLARED_ENCODING, and LANG_STATISTICS. A result of zero means no information is available.

Note: currently, this method always returns zero.

Returns: the type of match found for this charset.

UNKNOWN: ICU 3.4 This API might change or be removed in a future release.

getName

public String getName()
Get the name of the detected charset. The name will be one that can be used with other APIs on the platform that accept charset names. It is the "Canonical name" as defined by the class java.nio.charset.Charset; for charsets that are registered with the IANA charset registry, this is the MIME-preferred registerd name.

Returns: The name of the charset.

See Also: java.nio.charset.Charset java.io.InputStreamReader

UNKNOWN: ICU 3.4 This API might change or be removed in a future release.

getReader

public Reader getReader()
Create a java.io.Reader for reading the Unicode character data corresponding to the original byte data supplied to the Charset detect operation.

CAUTION: if the source of the byte data was an InputStream, a Reader can be created for only one matching char set using this method. If more than one charset needs to be tried, the caller will need to reset the InputStream and create InputStreamReaders itself, based on the charset name.

Returns: the Reader for the Unicode character data.

UNKNOWN: ICU 3.4 This API might change or be removed in a future release.

getString

public String getString()
Create a Java String from Unicode character data corresponding to the original byte data supplied to the Charset detect operation.

Returns: a String created from the converted input data.

UNKNOWN: ICU 3.4 This API might change or be removed in a future release.

getString

public String getString(int maxLength)
Create a Java String from Unicode character data corresponding to the original byte data supplied to the Charset detect operation. The length of the returned string is limited to the specified size; the string will be trunctated to this length if necessary. A limit value of zero or less is ignored, and treated as no limit.

Parameters: maxLength The maximium length of the String to be created when the source of the data is an input stream, or -1 for unlimited length.

Returns: a String created from the converted input data.

UNKNOWN: ICU 3.4 This API might change or be removed in a future release.

Copyright (c) 2006 IBM Corporation and others.