The Semantics.Datacenter Jobs Framework provides a powerful mechanism for processing and loading very large data sets into an RDF graph. For very large data sets it can be impractical to load all of the data into memory at once for processing. The solution the Jobs Framework provides is based on the following approach:
- Breaking the input data files into a larger set of smaller files that can processed independently, as tasks, by the worker nodes in the Semantics.Datacenter cluster.
- Each task running on the worker nodes then combine a subset of these data files into a new set of files that provide the required date or serve as input into another processing step (repeat step #1).
In the case of a typical data load job, it would look like:
- Run multiple tasks to split the input data files into a smaller set of data files. These files may or may not be RDF data files.
- Run multiple tasks to process the data files from step #1. Each task would operate on a subset of the files produced in step #1. A task would produce RDF statements and write each statement to a file that is partitioned typically by subject. These files provide an efficient binary representation of the RDF statements and are referred to as shard files.
- Run multiple tasks that each produce an in-memory graph partition file that is suitable for handling queries. Each task would gather the shard files that contain the RDF data for its respective partition an produce a partition file.
- (Optional) Run multiple tasks to compile each partition file into a read-only partition file. These files are much smaller than a writable partition file and therefore will consume less memory once loaded into a Semantics.Datacenter graph. This step can be skipped if a writable graph is desired.
- Run multiple tasks, one for each partition, to load the partition files into a Semantics.Datacenter hosted graph.
For more information on Semantics.Datacenter Partitioned In-Memory Graphs read this posting.
For more information on setting up a Semantics.Datacenter cluster read this posting.
Workflow Job
Semantics.Datacenter allows you to create jobs based on the Windows Workflow Foundation. Semantics.Datacenter includes a set of pre-build Workflow Activities that can be used to load your data. You can also create your own Workflow Activities using C# if you require some custom processing that is part of the data load or create your own Job type that is not based on Windows Workflows. In this article I will focus on the Workflow Job using our standard Activities.
Job Workspace
A data load job requires a set of directories that can be used by the tasks for processing data. Before creating your Workflow you should figure out this structure. The table below describes the directory structure I will use for my job.
| Directory | Contents |
| c:\job\rdf | The input RDF files. |
| c:\job\rdf\split | The split set of RDF files. |
| c:\job\parts\shards | The shard files resulting from parsing the RDF files. |
| c:\job\parts | The partition files generated from the shard files. |
| c:\job\parts\ro | The read-only partition files compiled from the writable partition files. |
Creating the Workflow Job
The Workflow Job can be created using the Model Manager. Connect to the Semantics.Datacenter cluster coordinator server and select the “Create…” option from the Jobs context menu in the tree control. This will display the Workflow Job designer where Activities can be added and configured to a Workflow Job. To add an Activity simply select one from the list and drag it into the workflow. Select the Activity in the workflow to display its property page where the activity is configured.
The following sections will describe the Activities I added to my Workflow Job in the order that they will be executed.
SplitFilesActivity
The SplitFilesActivity will split all the RDF files in a directory into a set of smaller files. The table below lists the settings for this activity.
| Property | Value | Description |
| Name | SplitFiles | The name of the activity instance. |
| InputDirectory | c:\job\rdf | The directory containing the input RDF files |
| OutputDirectory | c:\job\rdf\split | The output directory containing the split RDF files. |
| NumLines | 50000 | The maximum number of statements allowed in an output file. |
| ParseType | NTriples | The format of the input RDF files. |
| OverwriteDirectory | True | Deletes the contents of the output directory before processing the input. |
ParseFileActivity
The ParseFilesActivity will parse all the RDF files in a directory into a set of shard files that contain a binary representation of the RDF statements. The table below lists the settings for this activity.
| Property | Value | Description |
| Name | ParseFiles | The name of the activity instance. |
| InputDirectory | SplitFiles.OutputDirectory | The directory containing the input RDF files |
| OutputDirectory | c:\job\parts\shards | The output directory containing the shard files. |
| NumLines | 50000 | The maximum number of statements allowed in an output file. |
| ParseType | NTriples | The format of the input RDF files. |
| OverwriteDirectory | True | Deletes the contents of the output directory before processing the input. |
| BatchSize | 10 | The maximum number of input files to be processed by a single task. |
| PartitionCount | 16 | The number of partition files that can be created from the output shard files. |
| IgnoreParseErrors | False | If set to True any parse errors will be ignored and processing will continue. |
LoadPartitionsActivity
The LoadPartitionsActivity will load all of the shard files into a set of graph partition files. The table below lists the settings for this activity.
| Property | Value | Description |
| Name | CreatePartitions | The name of the activity instance. |
| InputDirectory | ParseFiles.OutputDirectory | The directory containing the input shard files |
| OutputDirectory | c:\job\parts | The output directory containing the partition files. |
| Arity | 3 | The arity of the graph (3=triples, 4=quads) |
| IncludeCaseInsensitiveIndex | False | If True, a case-insensitive index is created |
| IncludeFullTextIndex | False | If True, a full-text index is created |
| PartitionCount | ParseFiles.PartitionCount | The number of partition files that will be created from the shard files. |
CompilePartitionsActivity
The CompilePartitionsActivity will compile a set of writable partition files into a set of read-only partition files.The table below lists the settings for this activity.
| Property | Value | Description |
| Name | CreateReadOnly | The name of the activity instance. |
| InputDirectory | CreatePartitions.OutputDirectory | The directory containing the writable partition files |
| OutputDirectory | c:\job\parts\ro | The output directory containing the read-only partition files. |
LoadGraphPartitionsActivity
The LoadGraphPartitionsActivity will load a set of partition files into a Semantics.Datacenter hosted graph. The graph must exist otherwise the CreateGraphActivity should be used prior to this activity to create it. The table below lists the settings for this activity.
| Property | Value | Description |
| Name | LoadGraph | The name of the activity instance. |
| InputDirectory | CreateReadOnly.OutputDirectory | The directory containing the partition files |
| ConnectionString | net.tcp://host:7055/DataService | The connection string to the Semantics.Datacenter server that is hosting the graph to be loaded. |
| GraphUri | http://example.org/data | The URI of the graph. |
Save and Run
After creating the Workflow Job you should save it to a file so it can be used again, especially if your job fails for some reason. This is done by clicking on the “Save…” button and choosing a file name. Then click on “OK” button on the Workflow Job designer and to submit the job to the job queue where its tasks will be processed by one or more worker nodes in the Semantics.Datacenter cluster.