Any accessor which extends AbstractDataAccessor may use an XML configuration file to initialize its settings. This configuration has a particular schema which defines initialization settings. It also provides how various source fields should be interpreted, the data templates for mapping source data into data objects, and how the various graph, grid, tree, and dataset structures should be composed.
There are several example configuration files in the config directory. The elements within the files are described below:
<dataConfig> <init-param> <param-name>queryUrl</param-name> <param-value>http://mysolr.server.com/solr3/</param-value> </init-param>
The first elements of the file are initialization parameters.
These follow the same style and syntax as the initialization parameters
of a web.xml file. The name-value pairs are passed as a hashmap of
Strings to the Accessor's setInitParameters()
method.
<dataTemplate name="employee" nameKey="Full Name">
The dataTemplate tag defines a particular data template. Each data template defines a different way that the data source might represent data. The nameKey identifies which source field is used for the unique identifier for each record of data.
<fieldDesc fieldName="Supervisor" fieldType="text" /> <fieldDesc fieldName="Department" fieldType="enum" sourceField="Dept."> <values> <value>Accounting</value> <value>Software</value> <value>Marketing</value> <value>Operations</value> <value>Corporate</value> </values> </fieldDesc> <fieldDesc fieldName="Birth Place" fieldType="location" /> <fieldDesc fieldName="Id" fieldType="text" sourceField="SSN" /> <fieldDesc fieldName="Projects" fieldType="text" sourceField="Project1,Project2,Project3" multiValue="true"/> <fieldDesc fieldName="Birth Date" fieldType="time" sourceField="Birth_Date"> <values> <format>yyyy-MM-dd'T'HH:mm:ss.SSS'Z'</format> </values> </fieldDesc> <fieldDesc fieldName="When Employed" fieldType="time" sourceStartField="Start_Date" sourceStopField="End_Date"> <values> <format>yyyy-MM-dd</format> </values> </fieldDesc> <fieldDesc fieldName="Salary" fieldType="measure"> <values> <type>currency</type> </values> </fieldDesc> <fieldDesc fieldName="Phone Extension" fieldType="int" sourceField="Extension"> <transform>-x</transform> </fieldDesc> </dataTemplate>
The field descriptors define how the various fields are mapped into internal fields. The fieldName determines how the field will be referred to. For SemanticAccessors, this is the field name within the DataRecord. The sourceField determines which field within the source data is read for the field. If the sourceField is omitted, the accessor uses the fieldName as the sourceField. The sourceField can reference one or more source fields, separated by commas. The multiValue attribute indicates whether the field can include more than one value; if omitted, the field is a single-value field.
The fieldType determines what type of data the field represents.
Valid values are enum
, int
, location
,
measure
, text
, and time
. Each of
these field descriptors may include an optional default tag,
which defines the default value to be used if the field is not set. The
field descriptor may also include a values tag, which is used by
different field descriptors in different ways. These are listed below:
prajna.data.typeUnit
is used for the unit of measure. The values tag may also specify a unit
tag, which specifies the unit of measure. If not specified, the default
unit of measure for the specified measurement type is used.java.text.SimpleDateFormat
formatter, and the format
string should be understandable by that class. The values tag may also
include a durationUnits tag, which specifies the units of a time
span that is specified by duration. Valid units are millis
,
seconds
, minutes
, hours
, and days
.
The descriptor may also include a transform tag, which indicates a particular transform which should be applied to the values from the source field. In the example above, the Extension field contains a transform which removes any 'x' from the field. This would convert "x234" to "234", which could then be parsed as an integer.
The time field descriptor may refer to a sourceStartField and either a sourceStopField or a durationField instead of a sourceField. This is used when the time is a span of time, rather than a single instance. These specify which fields are used to construct the span of time.
Note that the type for unstructured text is text, not string. Several field types, notably the enum field and the location field, may also specify strings. Identifying the unstructured text field as text helps to define the expected use of the field.
<dataTemplate name="phoneCall" nameKey="callId"> <fieldDesc fieldName="Call Time" fieldType="time" sourceStartField="Start" durationField="Call_Time"> <values> <format>yyyy-MM-dd'T'HH:mm:ss.SSS'Z'</format> <durationUnits>seconds</durationUnits> </values> </fieldDesc> <fieldDesc fieldName="Sender" fieldType="text" /> <fieldDesc fieldName="Receiver" fieldType="text" /> </dataTemplate>
This is a second dataTemplate in the same file, which identifies a series of phone calls. An accessor may have any number of templates.
Below the templates are the definitions for the various data structures. Each of these definitions specifies the templates used to construct the structure. These template references may refer to multiple templates, separated by a space. Certain structures also include particular field references. For instance, the tree structure would need to identify which field of a node refers to its parent.
<dataset name="workers" itemClass="employee" />
This defines a dataset called workers, which uses the employee data template.
<graph name="calls" nodeClass="employee" edgeClass="phoneCall" origField="Sender" destField="Receiver" directed="true" />
This defines a directed graph of phone calls. The employee template is used for the nodes, and the phoneCall template is used for the edges.
<tree name="orgChart" treeNodeClass="employee" parentField="Supervisor" /> </dataConfig>
This last segment defines a tree representing the company
organizational chart. The names of the various structures are used by
the various getStructure calls, so calling getGraph("calls")
would return the graph of phone calls.
The configuration file may include ontology and reasoning tags, following the last structure definition, before the closing tag. These tags are described in the section on Ontologies.