public class FieldCleaner
extends java.lang.Object
Constructor and Description |
---|
FieldCleaner() |
Modifier and Type | Method and Description |
---|---|
void |
cleanField(java.util.Collection<DataRecord> records,
java.lang.String fieldName,
DataType fieldType,
boolean guess)
Clean the values in a particular field within a collection of
DataRecords.
|
java.lang.String |
guessValue(java.lang.String value,
java.util.HashSet<java.lang.String> goodValues)
Find the best match from a set of strings for a particular value.
|
void |
replaceValue(java.lang.String oldValue,
java.lang.String newValue,
java.util.Collection<DataRecord> records,
java.lang.String fieldName,
DataType fieldType)
replace all occurrences of a particular value within a particular field.
|
void |
setMaxDistance(int distance)
Set the maximum distance for word replacement.
|
void |
setMinCount(int count)
Set the minimum count for a field value.
|
public void cleanField(java.util.Collection<DataRecord> records, java.lang.String fieldName, DataType fieldType, boolean guess)
guess
flag is set, this method
attempts to find a good value to replace the value with. Otherwise the
value is removed. The case of the values is also normalized if the
guess
flag is set.records
- the collection of records to cleanfieldName
- the field namefieldType
- the field type. If null, the type is determined from
the set of recordsguess
- whether to guess new values for those values which occur
less than the minimum countpublic java.lang.String guessValue(java.lang.String value, java.util.HashSet<java.lang.String> goodValues)
value
- The string value to matchgoodValues
- the set of values to match againstpublic void replaceValue(java.lang.String oldValue, java.lang.String newValue, java.util.Collection<DataRecord> records, java.lang.String fieldName, DataType fieldType)
oldValue
- the old value to replacenewValue
- the new valuerecords
- the collection of DataRecordsfieldName
- the field where the value occursfieldType
- the type of fieldpublic void setMaxDistance(int distance)
distance
- the new maximum distancepublic void setMinCount(int count)
count
- the minimum count