Thursday, September 30, 2010

Loading large data sets via the Jobs Framework

The Semantics.Datacenter Jobs Framework provides a powerful mechanism for processing and loading very large data sets into an RDF graph. For very large data sets it can be impractical to load all of the data into memory at once for processing. The solution the Jobs Framework provides is based on the following approach:

  1. Breaking the input data files into a larger set of smaller files that can processed independently, as tasks, by the worker nodes in the Semantics.Datacenter cluster.
  2. Each task running on the worker nodes then combine a subset of these data files into a new set of files that provide the required date or serve as input into another processing step (repeat step #1).

In the case of a typical data load job, it would look like:

  1. Run multiple tasks to split the input data files into a smaller set of data files. These files may or may not be RDF data files.
  2. Run multiple tasks to process the data files from step #1. Each task would operate on a subset of the files produced in step #1. A task would produce RDF statements and write each statement to a file that is partitioned typically by subject. These files provide an efficient binary representation of the RDF statements and are referred to as shard files.
  3. Run multiple tasks that each produce an in-memory graph partition file that is suitable for handling queries. Each task would gather the shard files that contain the RDF data for its respective partition an produce a partition file.
  4. (Optional) Run multiple tasks to compile each partition file into a read-only partition file. These files are much smaller than a writable partition file and therefore will consume less memory once loaded into a Semantics.Datacenter graph. This step can be skipped if a writable graph is desired.
  5. Run multiple tasks, one for each partition, to load the partition files into a Semantics.Datacenter hosted graph.

For more information on Semantics.Datacenter Partitioned In-Memory Graphs read this posting.
For more information on setting up a Semantics.Datacenter cluster read this posting.

Workflow Job

Semantics.Datacenter allows you to create jobs based on the Windows Workflow Foundation. Semantics.Datacenter includes a set of pre-build Workflow Activities that can be used to load your data. You can also create your own Workflow Activities using C# if you require some custom processing that is part of the data load or create your own Job type that is not based on Windows Workflows. In this article I will focus on the Workflow Job using our standard Activities.

Job Workspace

A data load job requires a set of directories that can be used by the tasks for processing data. Before creating your Workflow you should figure out this structure. The table below describes the directory structure I will use for my job.

Directory Contents
c:\job\rdf The input RDF files.
c:\job\rdf\split The split set of RDF files.
c:\job\parts\shards The shard files resulting from parsing the RDF files.
c:\job\parts The partition files generated from the shard files.
c:\job\parts\ro The read-only partition files compiled from the writable partition files.
Creating the Workflow Job

The Workflow Job can be created using the Model Manager. Connect to the Semantics.Datacenter cluster coordinator server and select the “Create…” option from the Jobs context menu in the tree control. This will display the Workflow Job designer where Activities can be added and configured to a Workflow Job. To add an Activity simply select one from the list and drag it into the workflow. Select the Activity in the workflow to display its property page where the activity is configured.

image

The following sections will describe the Activities I added to my Workflow Job in the order that they will be executed.

SplitFilesActivity

The SplitFilesActivity will split all the RDF files in a directory into a set of smaller files. The table below lists the settings for this activity.

Property Value Description
Name SplitFiles The name of the activity instance.
InputDirectory c:\job\rdf The directory containing the input RDF files
OutputDirectory c:\job\rdf\split The output directory containing the split RDF files.
NumLines 50000 The maximum number of statements allowed in an output file.
ParseType NTriples The format of the input RDF files.
OverwriteDirectory True Deletes the contents of the output directory before processing the input.
ParseFileActivity

The ParseFilesActivity will parse all the RDF files in a directory into a set of shard files that contain a binary representation of the RDF statements. The table below lists the settings for this activity.

Property Value Description
Name ParseFiles The name of the activity instance.
InputDirectory SplitFiles.OutputDirectory The directory containing the input RDF files
OutputDirectory c:\job\parts\shards The output directory containing the shard files.
NumLines 50000 The maximum number of statements allowed in an output file.
ParseType NTriples The format of the input RDF files.
OverwriteDirectory True Deletes the contents of the output directory before processing the input.
BatchSize 10 The maximum number of input files to be processed by a single task.
PartitionCount 16 The number of partition files that can be created from the output shard files.
IgnoreParseErrors False If set to True any parse errors will be ignored and processing will continue.
LoadPartitionsActivity

The LoadPartitionsActivity will load all of the shard files into a set of graph partition files. The table below lists the settings for this activity.

Property Value Description
Name CreatePartitions The name of the activity instance.
InputDirectory ParseFiles.OutputDirectory The directory containing the input shard files
OutputDirectory c:\job\parts The output directory containing the partition files.
Arity 3 The arity of the graph (3=triples, 4=quads)
IncludeCaseInsensitiveIndex False If True, a case-insensitive index is created
IncludeFullTextIndex False If True, a full-text index is created
PartitionCount ParseFiles.PartitionCount The number of partition files that will be created from the shard files.
CompilePartitionsActivity

The CompilePartitionsActivity will compile a set of writable partition files into a set of read-only partition files.The table below lists the settings for this activity.

Property Value Description
Name CreateReadOnly The name of the activity instance.
InputDirectory CreatePartitions.OutputDirectory The directory containing the writable partition files
OutputDirectory c:\job\parts\ro The output directory containing the read-only partition files.
LoadGraphPartitionsActivity

The LoadGraphPartitionsActivity will load a set of partition files into a Semantics.Datacenter hosted graph. The graph must exist otherwise the CreateGraphActivity should be used prior to this activity to create it. The table below lists the settings for this activity.

Property Value Description
Name LoadGraph The name of the activity instance.
InputDirectory CreateReadOnly.OutputDirectory The directory containing the partition files
ConnectionString net.tcp://host:7055/DataService The connection string to the Semantics.Datacenter server that is hosting the graph to be loaded.
GraphUri http://example.org/data The URI of the graph.
Save and Run

After creating the Workflow Job you should save it to a file so it can be used again, especially if your job fails for some reason. This is done by clicking on the “Save…” button and choosing a file name. Then click on “OK” button on the Workflow Job designer and to submit the job to the job queue where its tasks will be processed by one or more worker nodes in the Semantics.Datacenter cluster.

0 comments:

Post a Comment