Monday, October 11, 2010

Querying via a ClientModel

The .NET API available in the Semantics Platform allows an application to query one or more sources of data in a federated fashion using SPARQL. The data sources can be local in-memory graphs or remote graphs hosted on another server. The ClientModel class provides the primary interface for executing SPARQL queries.

In this posting I will go through a simple example of how to use a ClientModel to query the contents of two RDF files.

I will start by constructing two in-memory GraphDataSource objects. Each data source will be filled with the contents of a local RDF file.

GraphDataSource dataGraph = new GraphDataSource();

using (StreamReader r = new StreamReader("c:\\data.nt"))
    dataGraph.Read<NTriplesReader>();

GraphDataSource ontologyGraph = new GraphDataSource();

using (StreamReader r = new StreamReader("c:\\ontology.nt"))
    ontologyGraph.Read<NTriplesReader>();

Next I construct the ClientModel object and add both of the graph object to the DataSources collection. Each data source added to a ClientModel must have a unique URI to identify it.

ClientModel model = new ClientModel();

model.DataSources["http://example.org/data"] = dataGraph;
model.DataSources["http://example.org/ontology"] = ontologyGraph;

As a convenience, I will define some namespaces on the ClientModel to make my SPARQL queries easier to write.

model.ParserOptions.Namespaces["rdf"] = NS.Rdf;
model.ParserOptions.Namespaces["rdfs"] = NS.Rdfs;
model.ParserOptions.Namespaces["x"] = "http://example.org/";

To execute the SPARQL query I simply call the Query method with a valid SPARQL query string. Note because I defined the namespaces on the ClientModel object I do not need to include them in the query using PREFIX statements.

Table results = model.Query(@"
    select ?label 
    from x:data 
    from x:ontology
    where {
        ?s rdf:type ?type; rdfs:label ?label.
        ?type rdfs:subClassOf x:Thing.
    }");

Finally I write the results of the query to the console window by iterating over the rows in the result table.

foreach (TableRow row in results.GetRows())
    Console.WriteLine(row[0].Value);

Thursday, September 30, 2010

Loading large data sets via the Jobs Framework

The Semantics.Datacenter Jobs Framework provides a powerful mechanism for processing and loading very large data sets into an RDF graph. For very large data sets it can be impractical to load all of the data into memory at once for processing. The solution the Jobs Framework provides is based on the following approach:

  1. Breaking the input data files into a larger set of smaller files that can processed independently, as tasks, by the worker nodes in the Semantics.Datacenter cluster.
  2. Each task running on the worker nodes then combine a subset of these data files into a new set of files that provide the required date or serve as input into another processing step (repeat step #1).

In the case of a typical data load job, it would look like:

  1. Run multiple tasks to split the input data files into a smaller set of data files. These files may or may not be RDF data files.
  2. Run multiple tasks to process the data files from step #1. Each task would operate on a subset of the files produced in step #1. A task would produce RDF statements and write each statement to a file that is partitioned typically by subject. These files provide an efficient binary representation of the RDF statements and are referred to as shard files.
  3. Run multiple tasks that each produce an in-memory graph partition file that is suitable for handling queries. Each task would gather the shard files that contain the RDF data for its respective partition an produce a partition file.
  4. (Optional) Run multiple tasks to compile each partition file into a read-only partition file. These files are much smaller than a writable partition file and therefore will consume less memory once loaded into a Semantics.Datacenter graph. This step can be skipped if a writable graph is desired.
  5. Run multiple tasks, one for each partition, to load the partition files into a Semantics.Datacenter hosted graph.

For more information on Semantics.Datacenter Partitioned In-Memory Graphs read this posting.
For more information on setting up a Semantics.Datacenter cluster read this posting.

Workflow Job

Semantics.Datacenter allows you to create jobs based on the Windows Workflow Foundation. Semantics.Datacenter includes a set of pre-build Workflow Activities that can be used to load your data. You can also create your own Workflow Activities using C# if you require some custom processing that is part of the data load or create your own Job type that is not based on Windows Workflows. In this article I will focus on the Workflow Job using our standard Activities.

Job Workspace

A data load job requires a set of directories that can be used by the tasks for processing data. Before creating your Workflow you should figure out this structure. The table below describes the directory structure I will use for my job.

Directory Contents
c:\job\rdf The input RDF files.
c:\job\rdf\split The split set of RDF files.
c:\job\parts\shards The shard files resulting from parsing the RDF files.
c:\job\parts The partition files generated from the shard files.
c:\job\parts\ro The read-only partition files compiled from the writable partition files.
Creating the Workflow Job

The Workflow Job can be created using the Model Manager. Connect to the Semantics.Datacenter cluster coordinator server and select the “Create…” option from the Jobs context menu in the tree control. This will display the Workflow Job designer where Activities can be added and configured to a Workflow Job. To add an Activity simply select one from the list and drag it into the workflow. Select the Activity in the workflow to display its property page where the activity is configured.

image

The following sections will describe the Activities I added to my Workflow Job in the order that they will be executed.

SplitFilesActivity

The SplitFilesActivity will split all the RDF files in a directory into a set of smaller files. The table below lists the settings for this activity.

Property Value Description
Name SplitFiles The name of the activity instance.
InputDirectory c:\job\rdf The directory containing the input RDF files
OutputDirectory c:\job\rdf\split The output directory containing the split RDF files.
NumLines 50000 The maximum number of statements allowed in an output file.
ParseType NTriples The format of the input RDF files.
OverwriteDirectory True Deletes the contents of the output directory before processing the input.
ParseFileActivity

The ParseFilesActivity will parse all the RDF files in a directory into a set of shard files that contain a binary representation of the RDF statements. The table below lists the settings for this activity.

Property Value Description
Name ParseFiles The name of the activity instance.
InputDirectory SplitFiles.OutputDirectory The directory containing the input RDF files
OutputDirectory c:\job\parts\shards The output directory containing the shard files.
NumLines 50000 The maximum number of statements allowed in an output file.
ParseType NTriples The format of the input RDF files.
OverwriteDirectory True Deletes the contents of the output directory before processing the input.
BatchSize 10 The maximum number of input files to be processed by a single task.
PartitionCount 16 The number of partition files that can be created from the output shard files.
IgnoreParseErrors False If set to True any parse errors will be ignored and processing will continue.
LoadPartitionsActivity

The LoadPartitionsActivity will load all of the shard files into a set of graph partition files. The table below lists the settings for this activity.

Property Value Description
Name CreatePartitions The name of the activity instance.
InputDirectory ParseFiles.OutputDirectory The directory containing the input shard files
OutputDirectory c:\job\parts The output directory containing the partition files.
Arity 3 The arity of the graph (3=triples, 4=quads)
IncludeCaseInsensitiveIndex False If True, a case-insensitive index is created
IncludeFullTextIndex False If True, a full-text index is created
PartitionCount ParseFiles.PartitionCount The number of partition files that will be created from the shard files.
CompilePartitionsActivity

The CompilePartitionsActivity will compile a set of writable partition files into a set of read-only partition files.The table below lists the settings for this activity.

Property Value Description
Name CreateReadOnly The name of the activity instance.
InputDirectory CreatePartitions.OutputDirectory The directory containing the writable partition files
OutputDirectory c:\job\parts\ro The output directory containing the read-only partition files.
LoadGraphPartitionsActivity

The LoadGraphPartitionsActivity will load a set of partition files into a Semantics.Datacenter hosted graph. The graph must exist otherwise the CreateGraphActivity should be used prior to this activity to create it. The table below lists the settings for this activity.

Property Value Description
Name LoadGraph The name of the activity instance.
InputDirectory CreateReadOnly.OutputDirectory The directory containing the partition files
ConnectionString net.tcp://host:7055/DataService The connection string to the Semantics.Datacenter server that is hosting the graph to be loaded.
GraphUri http://example.org/data The URI of the graph.
Save and Run

After creating the Workflow Job you should save it to a file so it can be used again, especially if your job fails for some reason. This is done by clicking on the “Save…” button and choosing a file name. Then click on “OK” button on the Workflow Job designer and to submit the job to the job queue where its tasks will be processed by one or more worker nodes in the Semantics.Datacenter cluster.

Wednesday, September 29, 2010

Semantics Platform v2.0 Maintenance Release Available.

(September 29, 2010) A maintenance release of the Semantics Platform v2.0.1.4 is now available for download from the Intellidimension web site at:

http://www.intellidimension.com/downloads/

This release includes the following updates:

  1. Fixed issue with range constraints in read-only in-memory graphs.
  2. Fixes to SPARQL query compiler.
  3. Query engine optimizations.
  4. Resolved issue with compilation of free-text expression that include Unicode characters.
  5. Enhancements to build in workflow tasks in the Semantics.Datacenter Job Framework.
  6. Improvements to job workflow editor in Model Manager.
  7. Other minor bug fixes

Monday, September 27, 2010

Partitioned In-Memory Graphs

The Semantics Platform supports partitioned in-memory graphs for storing and querying RDF data. These graphs use an extremely compact indexing format that minimizes the system memory required. As the name implies, the data in the graph can be partitioned to allow parallel read and write access to the data. This parallel access can greatly improve query execution times. Partitioned in-memory graphs manage all their data within the memory space of a single computer process. This is in contrast to a distributed graph which can manage RDF data in multiple processes on multiple machines (I will discuss these is another post).

A writable partitioned in-memory graph can compiled into a read-only form that requires even less memory to hold the same amount of data.

In the case of Semantics.Datacenter, partitioned in-memory graphs are backed by files, with support for journaling, to provide persistent storage of RDF data. Each partition in graph is storage in a separate file. This enables parallel read/write access to the data in the graph and vastly improves data load times.

Creating a Partitioned In-Memory Graph

Using the Model Manager you can create a partitioned in-memory graph in local client model or on a running instance of a Semantics.Datacenter server. The image below shows the screen for creating a graph on Semantics.Datacenter. The graph can be configured to store either triples or quads and also supports full-text indexing.

image

Loading a Partitioned In-Memory Graph

In a previous posting I discussed how to load data into a partitioned in-memory graph hosted in a local client model using the Model Manger (see Working with large RDF files in the Model Manager). However in the case of a graph hosted by Semantics.Datacenter this is accomplished much more efficiently by using the Semantics.Datacenter Jobs Framework. The Jobs Framework allows you partition the data load process into a set of tasks that can be run in parallel which often requires much less memory. I will discuss this process in another post.

Thursday, September 23, 2010

An introduction to Semantics.Datacenter clusters

Semantics.Datacenter is based on Intellidimension’s proprietary distributed graph store called IMDB. Semantics.Datacenter is deployed in a clustered configuration. A Semantics.Datacenter cluster is a set of network addressable services (“servers”) based on WCF that can be running one or more computers. A cluster must have one coordinator server and one or more worker servers. A cluster also requires a shared directory that is accessible by all servers in the cluster for both reading and writing files. If the cluster is deployed on multiple computers the shared directory must be on a networked storage device. The coordinator and worker nodes can be deployed to a single computer using a local file system directory.

image

The coordinator server manages all the activities for the cluster and is generally not used for application data storage or processing. The coordinator server manages the following:

  • The Job Queue
  • Server failover
  • Replication Coordination

The worker servers in the cluster are responsible for providing the bulk of the services for an application such as:

  • Hosting of Graphs
  • Execution of Tasks (a unit of work in a Job)

Deploying a cluster involves deciding on the hardware resources that you wish to use. This includes the physical computers as well as any network storage that will be required. Each server in the cluster will also require its own network address and the cluster coordinator network address must be reachable by all the worker servers in the cluster.

Step 1: Install Semantics.Datacenter
The first step is to install the product on each computer that will be used in the cluster.

Step 2: Install a Coordinator Server
Launch the IMDB Setup utility on the computer that will be hosting the coordinator server. From the main screen select the option to “Add a new server” followed by the option to install “A coordinator node in a new cluster”.  Follow the steps on the rest of the screens that will prompt you for a network address and the location of the shared directory for the cluster.

image

Step 3: Install the Worker Server(s)
For each worker server on each computer repeat Step 2 except on the second screen select the option to install “A worker node in an existing cluster”. In order to complete the installation you will be required to provide the network address of coordination server.

Next Steps
Now that the cluster is up and running you will need to configure it for your application. This configuration will vary significantly from application to application and often requires a large amount of testing and tuning to get right depending on the application. I will address several of these topics in future postings, including:

  • Graph Partitioning
  • Distributed Graphs
  • Data Replication
  • SQL Server Replication
  • Data Load Jobs

Wednesday, September 15, 2010

Working with large RDF files in the Model Manager

Often it is necessary to load and query a large RDF file to understand the data you are working with. The Model Manager is a good tool for this but has some limitations that I will describe in this posting.

In-Memory Graphs are intended to be used for short term storage of RDF data for performing some application processing. They are not intended to be used as a persistent store of RDF data. In the case of the Model Manger, any data imported into an In-Memory graph is stored in the project file. A Model Manager project file is primarily for storing the configuration of a model. It is not very good at all for storing large amounts of RDF data. Meaning it is not a database.

We have two database products that are very good at storing large amounts of RDF data. These products are called Semantics.Server and Semantics.Datacenter.

The links below provide information about both of these products.

http://www.intellidimension.com/products/semantics-server/

http://www.intellidimension.com/products/semantics-datacenter/

We have two instructional videos that show how to setup, load and query RDF data using each (see links below).

http://www.intellidimension.com/developers/videos/v2.aspx

http://www.intellidimension.com/developers/videos/v3.aspx

If you do not want to use one of our database server products. There is a more efficient way to work with large RDF data files using In-Memory graphs in the Model Manager. You can "Import" a RDF file into an In-Memory graph but then save it as an image file. This is done by creating an image file or read-only image file from the menu on the In-Memory graph. These files have a very compact storage format and can be loaded much more quickly than parsing an RDF file via the import menu. If you don't not intend on modifying any of the data I suggest you use read-only image file.

Now here is the IMPORTANT step. Drop the graph before saving your project so its contents are not saved to the project file. We are working on a better solution for this that will be available in a future release.

When you open your project file next time in the Model Manager. Just create a new In-Memory graph and load the image file using the menu on that graph. This will load much faster than an RDF file because it does not need to parse the file and it can be read directly into memory. Of coarse you will still need enough memory on you local machine to hold the entire contents of the image file but this will be less than the memory that is needed to parse the RDF file or load the data from the Model Manager project file.

Tuesday, September 14, 2010

Registering Custom SPARQL Functions

The Semantics Platform allows custom SPARQL functions to be registered with a client model. Once a function has been registered it can be used in a SPARQL query using the URI it was registered with. A custom SPARQL function can be implemented as a static method on a .NET class that has the proper delegate type. The Semantics Platform enables all the custom SPARQL functions in an assembly to be loaded provided they are declared with the proper attributes.

The example below shows a custom SPARQL function implemented in C# that supports the delegate type (ClientFunctionCall). The function calculates the area of a circle based on a single argument that represents the radius. The function checks that the proper number of arguments are provided. It calculates the area and returns it as the result argument. If the function is called with a result argument set to a value then it must be compared to the calculated area to see if the expression evaluates to true. The function should return false if the function fails to generate a value or the comparison fails.

[ClientFunctionAttribute(
    "http://example.org/fn/areaOfCircle", 
    "[bf]b", 
    Deterministic=true)]
public static bool AreaOfCircleFunction(
    FunctionCallParams funcParams)
{
    TableRow funcArgValues = funcParams.ArgRow;

    // Make sure we have exactly 2 arguments in 
    // our arguments collection.
    //
    // index 0: The result value
    // index 1: The radius
    if (funcArgValues.Count == 2 && 
        funcArgValues[1] != null && 
        funcArgValues[1] is RdfLiteral && 
        ((RdfLiteral)funcArgValues[1]).IsNumeric)
    {
        // A = PI * R^2
        double area = Math.PI * Math.Pow(
            (double)(RdfLiteral)funcArgValues[1], 2);

        RdfLiteral resultValue = area;

        // If a result value was provided (index 0) 
        // then test for equality.
        if (RdfValue.IsNull(funcArgValues[0]) || 
            funcArgValues[0] == resultValue)
        {
            funcArgValues[0] = resultValue;
            return true;
        }
    }

    return false; //evaluated as false
}

Notice, the method is declared with an attribute called ClientFunctionAttribute. This attribute requires a URI to identify the SPARQL function and a binding pattern. A binding pattern is a form of a regular express where ‘b’ represents a bound argument and ‘f’ represents an unbound argument (aka free). In the example above the binding pattern ‘[bf]b’ indicates the result argument can be bound or free and the first argument must be bound. The method attribute is also sets the property Deterministic to true to indicate it is a deterministic function.

One way to register the custom SPARQL function is using the Model Manager. Open a model, new or existing, and right click on the Functions folder to display the menu and select ‘Load Assembly…’.

image

This will display the dialog box shown above. Simply browse and select the assembly file that contains the custom SPARQL function(s). The dialog will display all the methods found in the assembly that are declared with the proper attribute and delegate type. You may select specific SPARQL functions to load or just load them all.

Once loaded they may be used in a SPARQL query using the Model Manager query window as shown below. The function is called using the URI it was registered with.

image

The custom functions assembly can also be loaded via the Semantics Platform .NET API using the helper class ClientFunctionLoader. This class will create the required class for registering a SPARQL function with an instance of a ClientModel. The C# code below provides and example of its use.

ClientModel model = new ClientModel();

ClientFunctionLoader loader = new ClientFunctionLoader(
    "c:\\SPARQL\\CustomFunctions.dll");

foreach (RdfUri uri in loader.FunctionUris)
{
    model.RegisterFunction(uri, loader.CreateFunction(uri));
}

This is the easiest way to register a custom SPARQL function.