1 of 34

Open Content

Introduction

Open Content is the content repository of the Creation and Presentation universe.

This book is intended for anyone managing or integrating with Open Content. If you're new to Open Content, we recommend starting with the overview. If you're a developer, feel free to jump straight to the API reference.

We urge you to reach out to us at support@infomaker.se if you have any questions. Certain sections are still incomplete, and in other sections we have yet to define well documented best practices.

Overview

Open Content: Make your content available for both your users, developers and readers

What is Open Content?

If you are thinking about creating your own headless CMS, Open Content will fit right in, and will solve several tedious parts of your journey ahead – storage, API’s, scalability, authentication and indexing just to name a few.

Open Content is a handy toolbox: we use it in our own solutions, for example as content backend for our Digital Writer and Newsroom apps as well as powering the Naviga web presentation layer. We also use it in our XLibris archive solution. Our customers uses it to power in-house built presentation solutions.

Together with the Naviga Creation and Presentation tools, Open Content delivers a standardised easy-to-maintain setup. You can also use Open Content as a content agnostic storage and search engine for digital content.

Using Amazon S3 as the main storage means in theory unlimited capacity. XML metadata files are used to describe the uploaded content. Properties are defined and extracted from the metadata using XPATH 2.0 expressions.

Indexing is done using Solr, an open source enterprise search platform built on Apache Lucene™, making your content accessible for any purpose.

Different content types (for example articles, images, lists, graphics) are separated and has their own specific properties setup. Relations between content items can be easily created, minimising the amount of requests needed to fetch the content.

We offer a standard OC setup for both content production as well as presentation, built on best practices. The standard setups are used with the Naviga Creation and Presentation Platforms.

What is it not?

It’s not a video- or streaming platform. If you want to store and edit streamed content, we recommend to use a specialised platform serving that purpose, like Flowplayer, Youplay or Youtube. It may be convenient to have access to such content within Open Content and then it’s just to add those objects and a subset of metadata to the Open Content as well, with a link to the original source.

Key features

The developer friendly availability platform A well documented and flexible platform that makes all content available, all the time.

The backend for your headless CMS A headless CMS without a content repository is like an electric car without batteries. Instead of building batteries build your chassis.

Built for Amazon AWS Run Open Content in AWS, then we can handle upgrades and changes with zero downtime with unlimited storage and backup possibilities.

Integrated to Naviga Content solutions Works out-of-the-box with solutions such as Newspilot, Digital Writer, Dashboard and Naviga web.

API’s for everything Use our user interfaces for admin and search, or use the OC REST API:s. Regardless of approach, it’s all open for integration.

Reliable backend Spend less time on server issues and let us manage the hosting. Open Content supports a range of different setups, from a small single-node setup to large, clustered, high availability setups.

Scalable to suit your needs Open Content support both SolR, frontend API and indexing scaling. Open Content scales depending on your needs (and wallet).

Proven solution Used daily by thousands of Creation users, as well as powering hundreds of apps and sites all over the world.

Content Types Open Content configuration makes it possible to group content into Content Type (typically : Article, Image, Page, Concept, Job, Planning items, Lists, Packages). We have a standardised konfiguration for all tools in the Creation suite.

Meta data standards

OC Concepts is an entire metadata universe – all stored and made available in Open Content

OC Concepts is a metadata structure, built around the IPTC NewsMLG2 standard. One of the most important parts of that is of course how to use it. For the editor, the developer as well as the end user. All concepts are stored and made available in Open Content.

In our view, metadata like categories and tags are not just text strings. Instead, each metadata is an object – each with a unique id, name and its own set of metadata and links.

Like an author. It could be just a name, But when you think of it as an object with a unique id, first name, last name, email, phone, description, avatar image, high res image and links things get really powerful.

These can be shown in your frontend if you want to, for example when showing articles for a specific category on a search page could then show the long description or image for that category.

Examples of Concepts:

Author
Category
Persons
Organisations
Topics
Places (poI:s or geo areas)
Story
Functional tags

The concepts are administered using our Dashboard application, your journalists use Digital Writer to choose the right concepts, and Everyware and the App Platform will show and let the user follow selected topics or geo areas.

Architecture

The architecture describes the upcoming 3.0 version of Open Content.

For info about older versions, please look at the release documentation at https://wiki.infomaker.se/display/OCS/Open+Content

Scaling Open Content doesn’t mean the same thing for all our customers. Some need a massive index, others have need for a smaller index but lots of traffic or API calls. Regardless of your needs, we are confident we will solve them.

The Open Content stack consists of several parts, all running in the Amazon cloud.

Load Balancer. The OC stack uses the standard Amazon application load balancers.
OC API. Is the REST API for queries, read and write, as well as the OC Admin API. Runs in ECS and scales horizontally.
S3 is the storage where all content items are stored.
RDS is the database where we store a selection of meta data.
SolrCloud is the Solr cluster that executes the queries, manages the indexes etc. It's deployed in a EKS cluster, from 1 Solr node and up. We always recommend at least 2 Solr nodes for redundancy.
Binlog is created by the RDS, and contains all modifications to the OC content.
Kafka is a streaming platform where we persist all changes to the content item. It also powers the Indexer services. We use the Amazon managed Kafka service.
The Indexer is the part that extracts the metadata to index and perform the index updates in Solr. The updates are then committed to the index by Solr. The indexer is running in ECS containers and scales horizontally.
The Notifier is used to create event-driven workflows.

We always recommend a multi-AZ setup for all parts of the stack. That means the Open Content stack is running on multiple datacenters in parallel, enabling high availability.

For Open Content pre 3.0, you'll need to use the master-satellite mechanism (see below) to reach multi-AZ redundancy.

When using Open Content as a creation backend, we always use a Satellite Open Content for the presentation layers. The production and presentation is totally separated each of them can be configured and scaled in the appropriate way.

We recommend to use the Naviga standard configuration for Creation and Presentation. They are both versioned and maintained by Naviga, and are updated when needed to be in sync with the Naviga Creation and Presentation tools.

Master - satellite In complex environments setting up multiple Open Content Satellites might be a suitable way to scale. All content is stored in an Open Content Master setup, and predefined replication rules make sure the correct content is available in each Satellite. This does not require additional storage, they are setup as read-only OC’s, reading the content from the same S3 bucket, saving both time and money. As content can differ each Satellite maintains its own index.

Open Content API

REST API for content There is a Swagger REST API available for adding, modifying and deleting content as well as performing various kind queries. The query syntax is following the standard Solr syntax, but is also adding a set of extra comfort functions , like related content.

REST API for admin There is also REST API available for all kind of administration issues, like index, properties, extraction and storage management.

Event log API The events for the last 30 days are recorded and stored in the event log, accessible using the event log api.

Read more about the Open Content Rest API and you could also even try it yourself.

Onboarding We offer onboarding, on location or remote, for Open Content developers to get the most out of the available tools and solutions.

Headless and end-to-end CMS workflows

The Naviga content could act as a standard end-to-end solution. You use our standard authoring setup in combo with our solutions for presentation on the web and in mobile apps. In that case, we are managing everything from setup, configuration, hosting, support etc. You are still able to interact with the backend, but we recommend to use our more high-level API:s for content creation (like ingestion of content) instead of using the more low level OC API.

You can also use Naviga content solutions as a headless CMS, and build your own presentation layer. In that case, we recommend to use our content distribution API to power your own presentation solution. You may also use the more low level OC REST API to power your presentation layer. The distribution API also offers a cache solution. If you use the OC REST API, you need to add your own cache mechanism between OC and your presentation engine. It's possible to just scale up the read capacity of Open Content, but that will be a quite expensive solution in most cases.

Both solutions uses a separate Open Content for production, and one to power presentation layers. When a content item, like an article, is ready to be published (useable) it's copied to the public content repo by the Replicator service.

Use and explore

How does it work

Any digital material can be stored in Open Content using the Open Content REST-API. Open Content configuration makes it possible to group content into Content Type (typically : Article, Image, Page, Concept, Planning, Lists, Packages). Content from different systems can normalised into the same Content Type.

A Content object (item) consists of a primary file and a metadata file describing the the primary file.

Normally xml-metadata files are used to describe the content uploaded and properties are extracted from the meta data files using XPATH 2.0 expressions.

Open Content is configured with a browser based UI or by YAML-files.

Typical use cases for Open Content are:

Long-time archive with the XLibris search client
Back-end for web and mobile publishing using Naviga web
Content repo for the Content Creation Suite

Open Content Docker

How to run Open Content in Docker on my own computer.

The Docker images for Open Content are primarily for development purposes, not production. So if you are a developer looking for how to start Open Content locally for integration testing or trying things out, then this is for you.

# create a directory where to work, in home oc-lab 
cd 
mkdir oc-lab
cd oc-lab

# Download the zip file from S3 
curl -s https://s3-eu-west-1.amazonaws.com/open-content-artifacts/opencontent-docker-configs.zip  \  
--output opencontent-docker-configs.zip

# Unzip 
unzip opencontent-docker-configs.zip

# Go to directory
cd opencontent-docker-lab

Start Open Content

docker-compose -f docker-compose-lab.yml up --detach

Wait until all containers are downloaded and started. Now there is an empty Open Content without configuration or content.

Logging

docker-compose -f docker-compose-lab.yml logs -f wildfly

Open Content UI

Configuration is done using the admin UI or the admin API. The UI can be found here http://localhost/admin.

Below is the menu for the Open Content admin UI.

Configure Open Content

The first thing that has to be configured is storage. This can either be done in the UI at http://localhost/admin or with this curl command:

curl -u admin:admin -d name=OpenContent -d path=/tmp http://localhost:8080/opencontent/admin/storage

Open Content configuration in this setup is done using a local copy of our Bitbucket repository for configuration. Use the Open Content admin UI to inspect the detailed settings for the different configuration options.

Go to the opencontent-configuration directory where the configure.sh script is

cd ../opencontent-configs/scripts

Configure Open Content for public use

./configure.sh \
http://admin:admin@localhost:8080/opencontent \
public

Configure Open Content for public and app use

./configure.sh \
http://admin:admin@localhost:8080/opencontent \
public-app

Configure Open Content for editorial use

./configure.sh \
http://admin:admin@localhost:8080/opencontent \
editorial

Activation of the configuration

curl -u admin:admin \
-X POST "http://localhost:8080/opencontent/admin/configuration/activate" \
-H "accept: */*" \
-H "Content-Type: application/x-www-form-urlencoded" \
-d "reason=configured from script&name=$(whoami)"

OC REST API

Link to the Swagger documentation

Open Content Swagger REST-API documentation

The link above assumes that you are running Open Content locally. The api docs can be found at http://localhost:8080/opencontent/apidocs/

Please note that you need to understand the NewsML document format used in the ution API is not yet available for 3rd parties, but will be during 2020.

Event logs

An overview of the eventlog and contentlog endpoints

The event log tells you what has happened after a last known event. Depending on your use-case you can either process the eventlog from the beginning (it keeps a history of one month), or start at the last event. It's useful to process all retained events if you want to prepopulate a cache, but if you just need it for invalidation of a cache that you start cold and build ad-hoc it makes more sense to start with the last event.

A request to the eventlog looks like this: GEThttps://oc.tryout.infomaker.io:8443/opencontent/eventlog If called without any query parameters you get events from the start of the log:

{
  "events": [
    {
      "id": 406362,
      "uuid": "f41c8f07-5992-5161-8ccf-c2347ee1c59c",
      "eventType": "ADD",
      "created": "2019-12-09T12:23:44.000Z",
      "content": {
        "uuid": "f41c8f07-5992-5161-8ccf-c2347ee1c59c",
        "version": 1,
        "created": "2019-12-09T12:23:44.000Z",
        "source": null,
        "contentType": "Image",
        "batch": false
      }
    },
    {
      "id": 406363,
      "uuid": "fc7710c3-1c9d-4df0-9a5f-c524774ef7de",
      "eventType": "ADD",
      "created": "2019-12-10T07:13:39.000Z",
      "content": {
        "uuid": "fc7710c3-1c9d-4df0-9a5f-c524774ef7de",
        "version": 1,
        "created": "2019-12-10T07:13:39.000Z",
        "source": null,
        "contentType": "Article",
        "batch": false
      }
    },
    ...
  ]
}

If you pass in a negative value, like so GET https://oc.tryout.infomaker.io:8443/opencontent/eventlog?event=-2, you get the last -N events in the log.

The id attribute in the events can be used to paginate though the eventlog. So if we have processed events up until 406374 we would ask the eventlog for all events after it, like so GET https://oc.tryout.infomaker.io:8443/opencontent/eventlog?event=406374:

{
  "events": [
    {
      "id": 406375,
      "uuid": "b73de3be-8b94-4c0f-9f6e-b058d077805f",
      "eventType": "ADD",
      "created": "2019-12-19T12:09:35.000Z",
      "content": {
        "uuid": "b73de3be-8b94-4c0f-9f6e-b058d077805f",
        "version": 1,
        "created": "2019-12-19T12:09:35.000Z",
        "source": null,
        "contentType": "Article",
        "batch": false
      }
    },
    {
      "id": 406376,
      "uuid": "f9f87e70-a0d7-4bc8-b2d4-5fab82760839",
      "eventType": "UPDATE",
      "created": "2019-12-19T12:11:13.000Z",
      "content": {
        "uuid": "f9f87e70-a0d7-4bc8-b2d4-5fab82760839",
        "version": 6,
        "created": "2019-12-19T12:00:34.000Z",
        "source": null,
        "contentType": "Article",
        "batch": false
      }
    },
    ...
  ]
}

To fetch the updated object the normal objects endpoint is used GET https://oc.tryout.infomaker.io:8443/opencontent/objects/f9f87e70-a0d7-4bc8-b2d4-5fab82760839?version=6

<?xml version="1.0" encoding="UTF-8"?>
<newsItem conformance="power" guid="f9f87e70-a0d7-4bc8-b2d4-5fab82760839" standard="NewsML-G2" standardversion="2.20" version="1"
  xmlns="http://iptc.org/std/nar/2006-10-01/">
  <catalogRef href="http://www.iptc.org/std/catalog/catalog.IPTC-G2-Standards_27.xml"/>
  <catalogRef href="http://infomaker.se/spec/catalog/catalog.infomaker.g2.1_0.xml"/>
  <itemMeta>
    <itemClass qcode="ninat:text"/>
    <provider literal="InfomakerConfig"/>
    <versionCreated>2019-12-19T12:11:13Z</versionCreated>
    <firstCreated>2019-12-19T12:00:34Z</firstCreated>
    <pubStatus qcode="imext:draft"/>
    <title>kkkk</title>
    <itemMetaExtProperty type="imext:type" value="x-im/article"/>
    <itemMetaExtProperty type="imext:haspublishedversion" value="false"/>
    <links
      xmlns="http://www.infomaker.se/newsml/1.0">
      <link rel="creator" title="Tryout Tryout" type="x-imid/user" uri="imid://user/sub/b911d79b-42c9-48cd-85cf-be5b1824a1fc"/>
      <link rel="subject" title="Accident and emergency incident" type="x-im/category" uuid="5e5e0695-3a21-47e9-87d3-f6bfa5791e46">
...

To get a log that covers all content in OC you must use the contentlog instead (https://oc.tryout.infomaker.io:8443/opencontent/contentlog). It comes with some other trade-offs though as it only contains the last event for every object and it doesn't publish any delete events. That means that it is useful for bootstrapping f.ex. a cache with a full data set, but not very useful for invalidation.

The contentlog event id is not consistent when migrating to a new version/install of OC either (it gets "compacted"), so depending on it as state for long-running tasks is not recommended.

Replication

Replication is an Open Content Service responsible for copying items between different Open Content instances

Open Content Replicator is a module that allows the Open Content to replicate content to an Open Content Satellite. The OC Satellite works as a "read only" and can store anything from all of the information in the Master to a part of it.

The Replicator can be run automatically in near real time and/or be triggered manually in batches.

For example, a satellite can consist of content with a specific setup of meta data (products, categories etc.) and another satellite can have a different content

Open Content Replicator is in this environment:

http://localhost:8180/replication

The replicator can be used to replicate objects from one Open Content to another Open Content.

Different types of replication exists:

Full replication; replicate objects using a query
Incremental replication; replicate object on incremental re-indexing event. Uses RabbitMQ Indexer needs to be configured for this.
- used for replication between editorial and public Open Content.
Batch replication; replicate objects on batch re-indexing event. Uses RabbitMQ Indexer needs to be configured for this.
- almost never used
Partial replication; updates target Open Content for filter changes
- not used
Event-log replication; polls event-log or content-log

Open Content Notifier

A module to create an event driven workflow. It enables you to listen to changes for specific queries and get notified when the answer for that query is changed.

Notifier can be used to release cache, or send notifications to Live Content Cloud. Possible to use Notifier for any http POST receiving server. Register the url towards which the notification details should be sent.

Documentation about the Open Content Notifier can be found here:

https://naviga-hub.atlassian.net/wiki/spaces/NP/pages/6558221758/Open+Content+Notifier+English

New JSON API:s - CCA and Duppy

In parallel, we are working on a modern, easy-to-consume JSON based version of the document format. That format will successively replace the NewsML XML format. More info about the new Naviga JSON Document format is found https://app.gitbook.com/@infomaker/s/document-format-v2/.

To build future-proof editorial integrations we are also developing the Content Creation API (CCA) to simplify and support the new Naviga JSON format. We strongly recommend to use CCA for integration with the editorial content repo. The CCA is not yet available for 3rd parties, but will be during 2020.

To build future-proof presentation integrations we are also developing the Content Distribution API ("Duppy") to simplify and support the new Naviga JSON format. We strongly recommend to use the Distribution API for integration with presentation solutions. The distribution API will be available during 2020.

Terminology: Other backend services

We often connect Open Content via the event log or the Notifier to create even-driven workflows. Here are some of the parts we are using internally.

Please note that those services are not included in the Open Content delivery, nor are they currently available for 3rd parties. We are explaining them here to feed ideas and explain how our infra structure works.

Live Content Cloud

We use our Live Content Cloud service for real time information to users of our Dashboard and Web and App Platform.

Everyone expects nothing less than information in near real time. Live Content Cloud is used when you want to push data to subscribers. Used in our App Platform for live updates of already downloaded content, or personalized push notifications based on OC Concepts.

Query Streamer

Query Streamer is a clouded “subscription service” for tools and presentation clients. You are able to set up a stream, a question like “sport content” and then subscribe to changes to the subscribed content stream and get notified in near real time. When a new item matches, Query Streamer notifies the subscriber/s.

QS uses Elastic Search Perculation in a cluster config as an Amazon Service. Subscriptions are persisted in QS.

Infocaster

Infocaster is the part that distributes the output from the Query Streamer (or other sources) to end subscribers. Written in node, It runs stateless at AWS, as scalable Docker instances with a load balancer. A message is sent as push notifications (SNS) or event via a SQS que.

Lab: upload

Upload of a content newsitem to Open Content

This section will show how an upload to Open Content is performed

To be able to do the exercises you may to prepare your system. You need an Open Content Server to perform upload request towards.
You need a bash terminal for execution of the scripts
- The prepare Windows section explains how to enable a bash terminal for Windows
Need to have following installed;
- aws cli
- imagemagick
- unzip
- jq

The examples will use show how to upload all objects types referenced from an Article.

Concepts
Images
Article

The article has relations to 6 different concepts and 3 images. Certain conventions must be known before uploading these to an Open Content.

UI and REST-API documentation for Open Content

Search client (XLibris) http://localhost/client
Admin client, http://localhost/admin
Rest-Api Swagger documentation, http://localhost:8080/opencontent/apidocs/

Use the links above to verify and study what happens in Open Content during the exercises below!

Study the scripts and try to understand what they do, modify files, script and execute them. If not expected behaviour, start a discussion.

Curl for object upload

Upload can be done by using the curl command

Upload is a multipart request towards Open Content /opencontent/objectupload endpoint.

Curl

Below is how a curl request for upload can look like:

/usr/bin/curl \
http://<server>:8080/opencontent/objectupload \
-u admin:admin \
-F id=<uuid> \
-F batch=<boolean> \
-F file=<file> \
-F file-mimetype=<mimetype> \ 
-F <file>=@<file> \
-F metadata=<metadata> \
-F metadata-mimetype=npexchange/article \
-F <metadata>=@<metadata> \
-F metadata2=<metadata2> \
-F metadata2-mimetype=npexchange/article \
-F <metadata2>=@<metadata2> \
-F preview=<preview>.jpg \
-F preview-mimetype=image/jpg \ 
-F <preview>.jpg=@<preview>.jpg \
-F thumb=<thumb>.jpg \
-F thumb-mimetype=image/jpg \ 
-F <thumb>.jpg=@<thumb>.jpg \
-F source=<source>

(*) Is used by the Open Content Replicator when performing replications towards a read only Open Content server

Environment

Update of the settings file

The exercises can be downloaded from

# In terminal do 
cd ~/oc-lab
mkdir lab-newsitem
cd lab-newsitem
curl -s https://s3-eu-west-1.amazonaws.com/open-content-artifacts/lab-newsitem.zip --output lab-newsitem.zip

# this will download the lab-newsitem.zip, unzip it 
unzip lab-newsitem.zip

Structure of lab-newsitem dir

lab-newsitem/
├── 0-config
│   ├── configure.sh
│   └── lab-newsml-config.yml
├── 1-concept
│   ├── 29889da3-e930-4846-a12b-096508e1054d
│   ├── 8c7437ce-a7ca-414d-8bfc-7bf2d1054fc3
│   ├── 9197a3ea-9624-404a-aef5-4d80eaadc99f
│   ├── b7399f0c-fb3d-4a4f-b849-9935a77d9512
│   ├── db09e859-43d4-42f8-a6ca-c810b653ec6a
│   ├── fb5911fa-b97f-436e-83f7-de7f7a203ea9
│   ├── upload-concepts.sh
│   └── uuids
├── 2-upload-image
│   ├── image-template.xml
│   ├── one.jpg
│   ├── one.jpg.uuid
│   ├── three.jpg
│   ├── three.jpg.uuid
│   ├── two.jpg
│   ├── two.jpg.uuid
│   └── upload-image.sh
├── 3-upload-article
│   ├── article.xml
│   └── upload-article.sh
├── 4-search
│   └── readme.md
├── 5-delete
│   ├── delete-mine.sh
│   └── delete.sh
├── 6-event-sourcing
│   └── listen.sh
├── build.sh
├── lab-newsitem.zip
├── readme.md
└── settings

Settings file

The settings file holds information about the host, user and password for the Open Content to be used with the script in exercises directories. Update the settings file to the Open Content you will use.

export OpenContentIp=127.0.0.1
export pemfile=
OC_USER=
OC_PWD=

Lab 0: Configuration

Upload and activate the configuration for an Open Content using Newsitem

This lab configures Open Content using editorial standard config

The script ./configure.sh in ~/oc-lab/opencontent-configs will configure Open Content:

cd ~/oc-lab/opencontent-configs/scripts

./configure.sh \
http://admin:admin@localhost:8080/opencontent \
editorial

Verify the configuration in Open Content admin UI (http://localhost/admin)

Activate the config either using the admin UI or curl below:

curl -u admin:admin \
-X POST "http://localhost:8080/opencontent/admin/configuration/activate" \
-H "accept: */*" \
-H "Content-Type: application/x-www-form-urlencoded" \
-d "reason=configured from script&name=$(whoami)"

Examine the + Configuration menu and more in Open Content admin UI

History
Compare (remove something)
Import/Export
Undo configuration

Lab 1: Concept upload

This exercise will upload 6 concepts to Open Content. These Concept are referenced from the article which will be uploaded in lab 3

The script ./upload-concepts.sh will upload 6 concepts to Open Content using a curl multipart POST request.

For more details how this is done take a look at the script:

Lab 2: Image upload

Image upload to Open Content calculating the correct filename for proper use when Open Content is used as the content storage for Writer Articles.

This exercise shows :

Calculate the filename when image is used by writer
- Using openssl
Create preview and thumb to be used by Open Content
Create xml metadata file
Upload of Image with preview, thumb and metadata file

For more details see the ./upload-image.sh file.

When uploading images for Digital Writer the image file needs to be upload to an internal S3 bucket and also be copied to an external S3 bucket with a calculated filename.

Lab 3: Article Upload

The exercise will upload the Article holding reference to previous uploaded objects.

This exercise shows :

Upload of an article with correct filename, script will insert the uuid and filename for the 3 images

Lab 4: Search

Learn how to perform search towards Open Content /search/ endpoint

A small tool for Open Content API test can be used for easier learning of Open Content search API. The UI can be accessed from the url http://localhost:8800 when Open Content docker-compose is started. T

This tool purpose is to make it easier to design search questions towards Open Content. It is not used in production environments.

The configuration in these exercises uses nested properties and assumes that you have the content from lab 1-3 uploaded.

To perform a simple search without any arguments :

curl -s -u admin:admin "http://localhost:8080/opencontent/search?" | jq .

Get only the uuid property for each hit

curl -s -u admin:admin "http://localhost:8080/opencontent/search?\
properties=uuid" | jq .

Get only Articles and the uuid

curl -s -u admin:admin "http://localhost:8080/opencontent/search?\
q=contenttype:Article&\
properties=uuid"

Get Articles and the concept names

curl --globoff -s -u admin:admin "http://localhost:8080/opencontent/search?\
q=contenttype:Article&\
properties=uuid,ConceptRelations[ConceptName]" | jq .

Get Articles and only Weekend Concept

curl -s --globoff -u admin:admin "http://localhost:8080/opencontent/search?\
q=contenttype:Article&\
properties=uuid,ConceptRelations[ConceptName]&\
filters=ConceptRelations(q=ConceptName:Weekend)" | jq .

Lab 5: Delete objects

Remove all objects uploaded so that it can be done again

This exercise shows how to delete objects in Open Content.

./delete.sh [uuid] will delete object with the specified uuid
./delete-mine.sh will delete objects with source set to lab-$(whoami)
- the script in this exercise sets the source to lab-$(whoami)

Lab 6: Event sourcing using event log

Show how to use the Open Content event log

In this exercise you will start a script which will poll the Open Content event log every 5th second. If any events are found the script will print information of the event.

The script will persist the last event to a file (lastevent) which holds the last event id processed by the script. When started next time and lastevent file exists the script will start processing events with id larger than the lastevent processed . This means that even if the listener is off. It will continue from last event next time the listener is started. This way no event is missed.

This is how to get the /eventlog/ endpoint response:

Response

Get the last event id use event=-1

In folder 6-event-sourcingis an example of a bash script which will poll the event log every 5 seconds and print information about what is happening. To try this :

Now the script polls the event log every 5 second and print out the events.

Releases

Release documentation and admin guides

All info about every Open Content version, release notes, upgrade info, admin guides etc.

Open Content 2.3

Wildfly is deployed as a container in Amazon ECS

In 2.3 the Wildfly service is available as a container image. When installing Open Content in Amazon AWS the Wildfly Service is deployed on Amazon ECS. The Wildfly instances belong to a private subnet and will not get public IPs. Access to the API has to be done through the public load balancer.

XLibris and OC Admin are also available as container images. They are using the latest versions of PHP and Apache httpd. When using Amazon ECS XLibris and OC Admin are deployed on their own EC2 instances without public IPs. The public loadbalancer is the only way to access the services.

API Cache

When Open Content is deployed on Amazon AWS an API cache can be put in front of Wildfly. It is running on port 9999 while Wildfly is running on 8080. When a client wants to use the cache it must use port 9999. Because both ports are open it's always possible to directly talk to Wildfly even if the cache is enabled.

Validation of properties on add and update

Until now there has been no validation of properties when adding content. For example if an article is uploaded with a property "TextLength" of type Integer but contains a string instead, then the upload still succeeds.

In 2.3 properties are validated according to their type. Validation is performed on adds and updates of content and if it fails it results in HTTP 400 Bad Request.

New property type called WKT (Well Known Text)

Until now latitude, longitude and spatial geometries had to be of property type "String". There has been no validation of the string until it has been indexed into Solr. When Solr refuses to accept the invalid WKT an entry has been added to the indexer error log. But the indexing happens in the background and the user who uploaded the content may never discover that something was wrong with the content that was uploaded.

In 2.3 there is a new property type called "WKT". This property is validated when content is either added or updated. If the property contains an invalid WKT string then the add or update will return HTTP 400 Bad Request.

Failing extractors are no longer suppressed

An XPath can be valid but throw an error anyway depending on the text it's applied on. Until now Open Content has silently suppressed these errors. A result of this is that content can be indexed while one or two properties are missing from the index.

In 2.3 no extractors are suppressed. If an XPath fails at content upload time the upload will respond with a HTTP 400 Bad Request. If an XPath fails at indexing time the content will not get indexed at all.

Fall back to a default dynamic path / if none is configured

Until now it has been mandatory to configure dynamic path for the storage.

In 2.3 the absence of a dynamic path will lead to content being placed in the root path. So for example if S3 is used for storage and no dynamic path is configured, then the uuid will be be in the root of the S3 bucket with no prefix added.

Upgrade to Solr 7.7

Open Content has used version 5.5 of Solr for a long time. In 2.3 it has been upgraded to Solr 7.7. This has the implication that all content has to be reindexed.

A backward incompatible change is that it is not possible to do a query like Pubdate:* to get all content that has a Pubdate. That query needs to be changed to Pubdate:[* TO *]. The reason is that Solr has deprecated TrieDateField for dates so Open Content is using DatePointField instead. This is true for index fields of type date, int, long, float and double.

Open Content does not deliver a clustered Solr out of the box yet, but it has been prepared by using SolrCloud and allowing multiple configured ZooKeeper.

Update Swagger apidocs to OpenAPI 3.0

The Swagger support has been reimplemented from scratch and has a couple of improvements.

The specification is updated from Swagger 1.2 to OpenAPI 3.0.

OpenAPI specification for Open Content is now generated at release time and is a static file so loading Swagger is much faster.

Swagger UI is updated to the latest version which is why the look and colours have changed.

Many REST API documentation errors have been fixed.

Upgrade to Java 11

Until now Wildfly, Indexer, Notifier and Replicator have been using Oracle's distribution of OpenJDK 8.

In 2.3 all Java based services use AdoptOpenJDK's distribution of OpenJDK 11.

Upgrade to Wildfly 15

Has support for Java 11.

Upgrade Saxon and get support for XPath 3.1

In the effort to keep as many 3rd party libraries as up-to-date as possible Saxon has also been updated. This means that XPath and XSLT extractors now do support XPath 3.1 and XSLT 3.0.

Nested properties API call may result in too long Solr GET request

When using nested properties Wildfly makes HTTP GET requests to Solr. In some circumstances the URL length hits a limit and Solr refuses the requests. In order to get around this Wildfly now sends the requests to Solr using HTTP POST instead.

Sortable flag on index fields is removed

Sorting in Solr is memory intensive. The sortable flag has been a not-perfect guard against out of memory errors in Solr. By not allowing to sort on just any arbitrary number of fields there has been some sort of protection.

In 2.3 docValues have been enabled for many index field types. This means memory consumption is lower when sorting on these index fields. Therefore the not-perfect guard (sortable flag) has been removed.

Open Content will ignore the sortable flag when it reads the old configuration and the next time the configuration is activated the sortable flag will not be in oc.yaml anymore.

Support for HTTPS in replicator

Until now the replicator has only been able to replicate using HTTP. Now in Open Content 2.3 the replicator can replicate content using HTTPS.

Deprecated and will be removed in 3.0

In a microservice world we want to split out the search and suggest functionality to its own service. The Wildfly service will not use Solr anymore. When using the Wildfly API you will know that you get the source of truth. When using the Search API you know that only Solr will be asked for data. Because Wildfly will not ask an eventually consistent search engine for data anymore some small parts of Wildfly will be backwards incompatible.

New in OC 3.0 (draft)

Open Content 3.0 is a major new version. It's not yet released, but is planned for release mid 2020.

The 3.0 version of Open Content is a major upcoming release. A lot of effort, on all levels, have been put into the areas of increased performance, scalability and availability. Many pieces have been optimised, rewritten or redesigned. The APIs are still the same, except for a few functions that has been deprecated.

SolrCloud support Running one single instance of Solr means that you have one single index running on one Solr node. Even if we have quick restore processes, that’s not a redundant solution. With the 3.0 version we have standardized a multi node SolrCloud setup as an option to the standard setup.

The SolrCloud setup runs in a Kubernetes (https://kubernetes.io/) cluster, starting with 3 Solr nodes plus the necessary orchestration mechanisms. The Solr version used in the 3.0 version is 8.x.x.

Support for multiple indexers Open Content versions previous to 3.0 supported one single indexer process. OC 3.0 allows you to deploy multiple indexers working in parallel. The indexer is no longer a single point of failure. The new indexer is also faster and running multiple indexers scales the indexing performance.

We have also offloaded a lot of work from the OC API fronts, like moving the property extraction to the indexer process. The OC API does not share the database with the indexer anymore. This increases the OC API performance in general and also provides a more predictable performance.

Apache Kafka The Kafka streaming platform (https://kafka.apache.org/) is now a part of the Open Content solution. In addition to the classic Open Content event log, all commits (add, update, delete) are inserted into the Kafka log. Kafka is used internally to power the new indexer processes as well as the upcoming Audit Trail module for the Naviga Writer and Dashboard. The complete content item is stored in Kafka (excluding binary artefacts).

Increased upload performance Bottlenecks in the upload process has been identified, fixed and optimised to get the highest possible upload throughput. Upload of content now scales more or less linear with the amount of OC API fronts used.

Increased read performance We have made a set of query and read optimizations and eliminated a couple of bottlenecks. The performance when querying for nested properties is substantially increased. Resolving nested properties is now parallelized to maximize the utilization of the hardware. The number of Solr requests needed for resolving nested properties is also substantially decreased. Using the new SolrCloud multi-node setup is also a good way to scale querying performance adding more Solr nodes. Both the OC API as well as the SolrCloud cluster now scales almost linear in read intensive setups.

Increased index update performance Using Solr sharding we are able to split indexes in smaller pieces and thereby increasing the commit capacity. The indexing process itself has also been re-designed to be more streamlined and efficient. We are now also able to run multiple indexers in parallel to boost the indexing performance.

AWS deployment Open Content 3.0 requires to be deployed in the AWS cloud. The OC 3.0 setup uses AWS services and deployment templates designed for AWS. Note: On premise installations are not supported (on premise installation is possible with Open Content up to 2.2.3).

Metrics Prometheus (https://prometheus.io/) is supported in the new 3.0 setup. The OC API, SolrCloud cluster, Kafka and the indexer processes all expose metrics that can be graphed and acted on.

Removed features in OC 3.0

Important: To enable a scalable and predictable solution, some old features have been removed:

Property extraction based on relations between content types has been removed. In OC 3.0 version you need to supply all metadata needed for property extraction within the content item itself.
Query time evaluations of XPath expressions has been removed from the Search API. Previously, if the value of a configured property was not indexed, OC would fetch the document and evaluate the XPath for the missing properties before returning the search result to make sure they were always included. In OC 3.0 only what is indexed will be returned in a search result. If you change the properties config, a reindex of the content is needed.
Support for multiple storages and import storage rules have been removed. There can only be one storage.
Identifiers as a feature is removed.
The import metadata rules function is removed.

Default search response properties can’t be configured anymore. The client should always specify what properties it wants in the search response. If the client does not specify any properties all will be returned.

XLibris Archive

Archive everything you ever created or published

All your digitally produced and published content in one place, searchable in a single interface – the XLibris web application.

The XLibris search client lets you easily search for anything you store. Predefined search modules for content types helps the end user find the right material. There is a facetting feature, to help narrow down the result if it is too big.

Searching in XLibris is fast, even in an archive with over 25 million objects. You can use the query syntax in a really simple manner, just like a Google search. But there is also a powerful query language available behind the scenes if you are interested in power searches.

Newspilot integrated with XLibris If you run Newspilot as your editorial platform Open Content / XLibris will work out-of-the-box. Newspilot has automated workflows for archiving articles, images (both published and unpublished), pages in PDF-format as well as the job planning.

Everything is connected – when you find a page, you will instantly see all articles and images published on that page. When you find a job, you will find other content that belong to the same job.

Build your own User Interface and workflows We have customers who have created their own workflow for both importing, searching and using images in Open Content. Archive historical content Scanned newspapers in PDF-format, with a predefined naming standard can also be imported and made searchable in XLibris.

Import content to XLibris

Open Content is used to power the XLibris Archive. When importing content to an Open Content based archive you have two main options:

Convert and migrate all your content items to the Content NewsML format used by all Naviga Creation and Presentation tools. Of course, binary artefacts are more or less just copied to the OC Storage, but all meta data files, articles, meta data etc needs to be migrated. Depending of the quality and format used in your old content that can be a really massive work, or not. Contact us to discuss the scope of your migration. The benefit of doing so is that your content is more future proof, streamlined to one, well known format. Content items can more easily be reused. Standard configuration can be used.
The other option is more like a "copy" of the content to the OC Storage, and then create a configuration that adapts to your content. You still need your articles and meta data side car files in XML format. If you don't have that you need to migrate your content anyway. The advantage with this model is lower migration costs. On the other side, you'll have a more complex, customised configuration. Your content is still in the original format. You'll not be able to reuse content items in the same easy way as form migrated content.

Consult us to discuss what's best in your specific case.