public abstract class EntityExtractor extends java.lang.Object implements RecordReasoner
Abstract class for entity extractors. This class provides some basic utilities for extracting entities, extracting typed entities from InputStreams, Files, or Strings. Also, this class includes parsing of DocData and DataRecord elements for entities of interest.
Each of the extract
methods returns a Map. This map contains
the entities as keys in the map. If the entity extractor supports typing of
the entities, the values of the map should identify the type of each entity
extracted.
Constructor and Description |
---|
EntityExtractor() |
Modifier and Type | Method and Description |
---|---|
void |
addTagFieldMapping(EntityType tag,
java.lang.String field)
Associate a tag generated by the entity extractor with a particular
field in a DataRecord.
|
void |
addTextField(java.lang.String field)
Add a field to the set of text fields queried for entity extraction.
|
boolean |
augmentRecord(DataRecord rec)
Augment the data record with information from the reasoner.
|
java.util.Map<java.lang.String,EntityTag> |
extract(java.io.BufferedReader reader)
Extract the mapping of entities from a reader.
|
java.util.Map<java.lang.String,EntityTag> |
extract(DocData data,
boolean storeTokens)
Extract the mapping of entities from a DocData object.
|
java.util.Map<java.lang.String,EntityTag> |
extract(java.io.File file)
Extract the mapping of entities from a file.
|
java.util.Map<java.lang.String,EntityTag> |
extract(java.lang.String text)
Extract the mapping of entities from a string of text.
|
protected abstract void |
extractEntities(java.lang.String text,
java.util.Map<java.lang.String,EntityTag> tokenMap)
Extract the entities, loading the tokenMap provided.
|
abstract java.util.Map<java.lang.String,java.lang.String> |
extractEntityRawTypes(java.lang.String text)
Retrieve the entities and the raw types for this entity extractor.
|
java.lang.String |
getDefaultTagField()
Get the default tag field used to store all extracted entities in the
data record.
|
java.lang.String |
getModel()
Get the model string.
|
boolean |
isDebug()
Check whether the debug flag is set for this extractor
|
void |
setDebug(boolean flag)
Set the debug flag.
|
void |
setDefaultTagField(java.lang.String field)
Set the default tag field used to store the extracted entities in the
data record.
|
void |
setModel(java.lang.String model)
Set the model for this entity extractor.
|
public void addTagFieldMapping(EntityType tag, java.lang.String field)
augmentRecord
method is
called, any entities matched to the tag are added to the specified
field. If the field is null, the entities will be added to the default
tag field. If the default tag field is also null, the entities will be
mapped to an internal field in the DataRecord.tag
- a tag generated by the entity extractorfield
- a field within a DataRecord which will contain the entities
extractedpublic void addTextField(java.lang.String field)
augmentRecord
method is called, the text in all
text fields will be used in the entity extraction process.field
- the text field name in the DataRecordpublic boolean augmentRecord(DataRecord rec)
addTextField
method. It then adds the extracted
entities to the data record as text values. If the tagFieldMap has been
set, it uses the mapping to define which field each token should be
added to. Otherwise, if the tagField has been set, this method places
all tokens in the tagField. Otherwise, it stores the entities as
internal values.augmentRecord
in interface RecordReasoner
rec
- the data record to augmentpublic java.util.Map<java.lang.String,EntityTag> extract(java.io.BufferedReader reader) throws java.io.IOException
extract(text, tokenMap)
method. The map is loaded through
successive reads of the stream until the stream is empty.reader
- The reader for reading the textual data.java.io.IOException
- if there is a problem reading from the readerpublic java.util.Map<java.lang.String,EntityTag> extract(DocData data, boolean storeTokens)
extract(String text)
method. If the storeTokens flag is
set, all extracted entities will be added to the set of keys for the
DocData object.data
- the DocData objectstoreTokens
- if the method should store the entities as keys for
the DocData objectpublic java.util.Map<java.lang.String,EntityTag> extract(java.io.File file) throws java.io.IOException
extract(BufferedReader reader)
method.
This method does not distinguish any special format for the text file,
treating all terms as equal.file
- The file containing the text.java.io.IOException
- if there is a problem reading from the filepublic java.util.Map<java.lang.String,EntityTag> extract(java.lang.String text)
text
- The body of text.protected abstract void extractEntities(java.lang.String text, java.util.Map<java.lang.String,EntityTag> tokenMap)
text
- the text to extracttokenMap
- the map to contain the tokens in.public abstract java.util.Map<java.lang.String,java.lang.String> extractEntityRawTypes(java.lang.String text)
text
- the text to parse for entitiespublic java.lang.String getDefaultTagField()
public java.lang.String getModel()
public boolean isDebug()
public void setDebug(boolean flag)
flag
- the debug flagpublic void setDefaultTagField(java.lang.String field)
field
- the tag fieldpublic void setModel(java.lang.String model)
model
- the model for the entity extractor