Prajna provides interfaces and support classes for entity extraction and verification. Many different entity extraction engines are available as java packages, web services, or commercial applications. Therefore, Prajna does not attempt to implement its own entity extraction models or algorithms. Instead, it provides a framework for an application to use any other entity extraction engine.
Different entity extraction engines may separate any discovered entities into different types. Sometimes these types vary widely, making it difficult to compare one entity extraction result to another. To address this issue, the Prajna entity extraction components reduce the different entity types to a set of basic entity types, which are defined in the EntityType enumeration:
Most of these should be fairly obvious. UNKNOWN may be used when an entity extraction engine might think it has discovered an entity, but cannot determine the type. REJECTED indicates that the string should NOT be used as an entity. QUANTITY might indicate a numeric quantity, including currency or measurement. DATE_TIME indicates a particular day or date, such as "Thursday" or "July 4, 1776". A particular entity extraction engine does not need to identify all of these types.
EntityTags are more complex objects, and represent both the entity type and other information about the particular entity. They will also include the original label that an extraction engine identified for the entity, and any alternate ways of expressing the same entity (also known as co-referencing).
The EntityExtractor abstract class provides a convenient framework for integrating with an entity extraction engine or utility. The engine may be a Java library, an external web service, or a commercial application with some form of API. Prajna can interact with any of these types of systems by designing an EntityExtractor implementation.
The EntityExtractor abstract class defines three methods which must be
implemented. The first,
extractEntities(String text, Map<String, EntityTag>
tokenMap)
, should parse the text to identify the entities and place them in the
token map. The second method,
extractEntitiesRawTypes(String text)
, extracts the entities in a implementation-dependent way. It returns a
map with the entity strings mapped to their raw types, as defined by the
particular entity extractor. The difference between this method and
extractEntities() is that extractEntities extracts a set of entity tags
that normalize the entity types. This method does not return normalized
types, so the types returned in the map may be any value. One common way
to implement extractEntities() is to call this method, and then map the
raw entity types to the normalized types.
To understand the difference, consider a text which includes both "Barack Obama" and "President Obama". A particular entity extractor might identify Barack Obama with the label "President". This would be the raw type for the term. Since presidents are persons, the EntityTag for Barack Obama would have an EntityType of PERSON, with President set as one of the raw types. Assuming the engine identified President Obama as the same entity, that term would be included as an alternate label for the entity. All of this information would be contained in the EntityTag.
The final method for an EntityExtractor is the
setModel(String model)
method. The model string passed into this method can be a path to
lexical models, a license key, or a URL for a web service. The actual
interpretation of the model string is implementation-dependent, and
should be documented in the documentation for the extractor.
The entity verifiers provide an interface to utilities that can positively identify the correctness of a particular entity and its type. For instance, while an entity extractor may guess that "George Washington" is a person, the entity verifier would positively validate the extractor's discovery. A verifier would typically consult an external information source - either a custom database, list of terms, an authoritative website, or similar sources - in an attempt to validate and verify the entity extractor's assertion.
In addition, a verifier should also provide a canonical normalized form for the entity. For instance, "President Obama" should normalize to "Barack Obama". This normalization should be consistent, so even if terms are discovered in different order in different document, the normalized form will be the same (e.g., don't use a "first discovered" approach).