public class FieldCleaner
This class is a set of utility methods for cleaning text fields in a set of
DataRecords. It can be used to clean up spelling errors or discrepancies in
the set of DataRecords. The methods use the Levenshtein Word Distance
algorithm to calculate the distance between words. This class operates on
DataType.TEXT and DataType.ENUM fields within a set of DataRecords.
public void cleanField(java.util.Collection<DataRecord> records,
Clean the values in a particular field within a collection of
DataRecords. This method scans through a particular Text or Enum field
and identifies all values within that field. If a particular value has a
number of occurrences that is less than the minimum count, the value is
replaced or removed. If the guess flag is set, this method
attempts to find a good value to replace the value with. Otherwise the
value is removed. The case of the values is also normalized if the
guess flag is set.
records - the collection of records to clean
fieldName - the field name
fieldType - the field type. If null, the type is determined from
the set of records
guess - whether to guess new values for those values which occur
less than the minimum count
public java.lang.String guessValue(java.lang.String value,
Find the best match from a set of strings for a particular value. The
guess will be within the minimum word distance. If there is no word
within the minimum distance, this method returns null.
value - The string value to match
goodValues - the set of values to match against
the best match, or null if no match is within the maximum word
replace all occurrences of a particular value within a particular field.
This method checks the field values for a particular field, and replaces
it with a new value. If the fieldType is null, the method checks for the
field type automatically.
oldValue - the old value to replace
newValue - the new value
records - the collection of DataRecords
fieldName - the field where the value occurs
fieldType - the type of field
public void setMaxDistance(int distance)
Set the maximum distance for word replacement. One erroneous word can be
updated to a valid word if it is within the distance from the erroneous
word. The default maximum distance is 3.