In today's world, information can come from a variety of sources
in a variety of formats. Prajna provides readers for a number of
different data sources. Prajna also supports applications which format
data in a variety of ways.
Characteristics of Data Sources
The behavior and characteristics of all applications depend on
the characteristics of the underlying data. A source of data can be
characterized in several ways.
- Size: Is the body of data a small dataset which can be
easily retained in memory? Is it a moderate size, where representing a
significant fraction of the data provides a representative sampling? Or
is it a tremendous data repository with millions or billions of records
which require more sophisticated filtering and navigation?
- Mutability: How frequently does the data change? Is the
body of data static, only changing on an occasional basis? Does it
receive frequent updates that an application needs to periodically
check for? Or is the data a continuous stream which requires continuous
- Atomic Objects: What are the atomic objects represented
by the data? What does the data represent? A particular body of data
may represent multiple objects depending on the context, but a
particular application will need to differentiate these objects.
- Object Structure: Are the objects structured, such as
from a relational database? Or are they totally unstructured, such as
text documents? If the object is structured, what are the fields and
their data types?
- Implied Knowledge: Are all of the fields of an object
important to understanding it? What fields are useful for
comprehension, and what fields are present simply for developer
convenience? Are the auxilliary data elements, such as file location or
timestamp on a particular data object, important? Do the objects
include implicit knowledge? For instance, what are the units of measure
for any measurements?
- Data Structures: Does the data imply or define any data
structures or relationships between the individual records, such as a
graph, tree, or grid?
- Format: What format is the data stored in? Is it stored
in an SQL database? A collection of XML data files? Something else? How
does the data need to be accessed?
- Data Fusion: Is there only a single data source? Or are
there multiple data sources which are referenced? If there are more
than one data source, how do the records from each data source relate
to one another? Do the records need to be fused? Are there common
identifiers between the data sources?
Prajna provides a number of different utilities for accessing
data from a variety of sources. Depending on the characteristics of the
data source, a developer may wish to use a DataAccessor to extract
various data structures, or a DocCorpus for accessing documents in an
unstructured corpus. A developer can also use a FormatReader for data.