The Document Corpus

Prajna represents collections of unstructured documents with the DocCorpus class. The DocCorpus class is a collection of DocData objects which can be used to represent various types of unstructured documents. In addition, the DocCorpus provides several utilities, such as computing similarity graphs.

The DocCorpus class supports two different types of graph for a document corpus. The first type of graph is the similarity graph, which is an undirected graph with weighted edges. The documents form the nodes of the graph, while the edges are weighted according to the similarity score of the documents.

The second type of graph is the key graph. The key graph uses the keys of the documents for its nodes. The edges are AccumEdge edges, composed of the documents which share a particular pair of keys.

The DocCorpus class will be extended to offer additional capabilities in the future.