Article Image

Internal Data Catalog: A Journey to Better Data Management with CKAN

6 mins read
Key facts
Service providers:

Datopian

Client:

Not disclosed due to contractual obligations

Services:
CKAN Consultancy; CKAN Development; CKAN Features; CKAN Hosting & Support; Custom Data Portal; UX/UI Design;
Period:
January 2024 - Present

Brief summary of the project.

Datopian collaborated with a global consulting and engineering company to develop a customized data management portal using CKAN, addressing specific requirements to streamline data accessibility and usability for their global team.

Exclamation mark pointing the problem
Problem

Client's in-house data catalog solution was overly complex and technical, making it unsuitable for their users' needs. They required a simpler, user-friendly solution to manage and share datasets efficiently across their organization.

Interrogation mark pointing the need
Need

The Client needed a straightforward data catalog to facilitate easy inter-organizational dataset sharing, with features including data access permissions, usage analytics, Azure SSO login, automated Azure user and organization synchronization, and custom dataset metadata schemas.

Checkmark pointing the solution
Solution

Datopian designed and implemented a CKAN-based data portal with a customized UI, integrated Azure SSO, and synchronized user and organization data from Azure to CKAN. The solution included custom metadata schemas, analytics tracking, and an intuitive interface, enhancing data management and accessibility.

Main technologies & tools used
CKAN
Python
JavaScript
Postgre
Redis
Solr
Azure

Context

NOTE: The client cannot be disclosed due to contractual obligations, and the screenshots have been anonymized by changing brand colors and removing the logo.

The Client is a global consulting and engineering business that collaborates with customers, colleagues, and partners to craft sustainable and long-lasting solutions that improve quality of life. It’s comprised of thousands of people, with expertise in engineering, architecture, energy, and environment, located around the world. Their wide reach provides an extensive pool of bright minds available for collaboration, which gives them an edge when solving hard problems.

Their ambitious sustainability goals, including full carbon neutrality within the next 30 years and 100% of revenue coming from projects driving sustainability within the next 5, are evidence of their vision for a better future.

The situation

When the Client first contacted Datopian, they had already attempted to adopt an in-house data catalog solution, but it proved too complicated and technical for their users’ use cases. What they needed wasn’t necessarily feature-rich, rather, it was feature-specific.

The criteria

At a high level, the main requirement was a simple data catalog—an internal data management solution with easy inter-organizational dataset sharing for thousands of users around the world that could streamline operations and enhance data accessibility.

With that as the base requirement, the additional criteria were:

  • Avoid duplication of data
  • Data accessibility is based on a per-user basis (per-user access permissions)
  • Track data and page usage (usage analytics)
  • Azure SSO login
  • Automated syncing of Azure users and organizations to the data portal
  • Custom dataset metadata schema
  • Easy to use for non-technical users

The solution

Datopian worked closely with the Client throughout the implementation process to ensure that everything, down to the smallest features, would not only “get the job done” but improve their processes and enhance the data management process as a whole. The implementation of CKAN facilitated seamless enterprise data integration, allowing for consistent data management practices across the client's global operations.

The UI

A growing number of Datopian’s clients are opting for decoupled frontends to allow for robust customization. To keep things simple, the Client Data Portal doesn’t. All theming and UI changes were done within CKAN itself, keeping maintenance and complexity minimal (for the code and infrastructure).

The Home Page

The Home Page

The site’s home page is straightforward. A fairly standard navigation bar sits at the top (present on every page), with a search field for datasets, and links to “Datasets”, “Client Organizations” (normal CKAN Organizations), “Subjects” (custom Group type), “Geography” (custom Group type), “About”, and a link to the current user’s profile page.

Navigation

Below the navigation bar is a welcome banner with four links, each representing a “Data Type” (a custom schema field) of the portal's datasets. The numbers next to each type indicate the number of datasets Below the navigation is a welcome banner with four links, each representing a “Data Type” (a custom schema field) of the portal's datasets. Clicking on one of the types will redirect to the search page with that type of dataset selected.

Custom Banner

Below the welcome banner, there are two tabs. The first—the default when the page loads—is “Sectors” (another custom dataset metadata schema field), which contains the four sectors that can be associated with a dataset. As expected, clicking on one of the sectors redirects to the search page with results for that sector.

Sectors

Clicking on the second tab will switch to the “Subjects” list, where users can navigate to the search results for a given subject:

Subjects

The Search Page

Client’s custom group types (“Subjects” and “Geography”) have been added as filters, along with their custom dataset metadata schema fields (“Sectors” and “Data Types”).

Search Page

The Dataset Metadata Form

Datasets are where the most changes have taken place, both in how they’re presented and within the metadata itself. We introduced a custom metadata schema to better organize and categorize datasets. The new schema fields are:

  • “Publisher” - Select a tag from the custom “Publishers” CKAN Tag Vocabulary (creates a new tag if not found)
  • “Sector” - Select one of “Cities”, “Transportation”, “Environment”, or “Miscellaneous”
  • “Data Type” - Select one of “Public”, “Bought”, or “Internal”
  • “Data Published” (Valid Period Start) - Start date of data coverage (select from calendar)
  • “Data Valid Until” (Valid Period End) - End date of data coverage (select from calendar)
  • “Update Frequency” - Select one frequency from a pre-defined list, with values such as “Weekly”, “Monthly”, “Irregular”, etc.
  • “Geography” - Select one or more Geographies (CKAN custom group type) associated with the dataset
  • “Related Datasets” - Select one or more datasets associated with the dataset
  • “Super Users in Client” - Select one or more users who are familiar with the dataset
  • “Image” - An optional image that represents the dataset

Additionally, search and autocomplete functionality (using Select2) has been added to all of the “Select…” fields (including the default CKAN fields where applicable):

Select

The Dataset Metadata View Page

Dataset Metadata Page

The layout of the metadata view page has changed quite a bit. It now opts for a more streamlined presentation, with unnecessary elements removed. For example, removed elements include:

  • The “Dataset”, “Groups”, and “Activity Stream” tabs
  • Social media links
  • Organization image

Here’s an example of the vanilla CKAN dataset page:

Vanilla CKAN Dataset Page

You might also notice that “Data and Resources” has been removed from the Client Data Portal. That section has been replaced with the “Access Data” dropdown button, which cleans up the page, yet still provides the functionality to view the resource titles, descriptions, and types, as well as navigate to the resources themselves:

Access Data

To the right of the title, description, and Access Data button is the dataset image:

Dataset Image

Users can click on “Fullscreen” to expand the image:

Full Screen Image

Next, the metadata is neatly laid out, making it easily readable (and clicking on a metadata item will redirect to the search page with similar results):

Metadata

Each metadata item on the page can be hovered over to provide a description:

Description

Description Details

To the right of the main metadata section, the Data Custodian and Steward information is provided, including the user's “Department” (more on this in the “User Page and Azure Syncing” section—this is a custom User metadata field), email address, and links to the search page with results for all datasets where the user is either a Custodian or Steward:

User Metadata

The User Page and Azure Syncing

The user page hasn’t changed much, but there is one important change. It now includes the “Client Organization” (also known as “Department”) the user is a part of, along with their “Job Title”:

The User Page

Users, select user information, and Organizations (Departments) are automatically synced daily from Azure to CKAN. This is handled by a custom CKAN CLI command that gets run using a daily Cron job. The general flow is:

Cron job triggers the “sync-users” CLI command
  • Initialize a “ConfidentialClientApplication” using the msal Python library
  • Validate credentials (these are set as secrets in the infrastructure)
  • Get access token
  • Get all users in Azure (including their Job Title, E-mail Address, and Department)
  • Iterate over users
  • Validate user (filtering out test accounts, users from Departments that aren’t relevant, etc.)
  • Check if the user already exists in CKAN and then either create or update their CKAN user
  • Check if the user’s Department already exists in CKAN as a Client Organization
  • If not, create a new Organization
  • Add the user as an Editor to the Organization (if they’re not already)

The Glossary

The last change is a static page containing a “Data Glossary”. This provides a quick reference for general key terms as well as specific metadata fields:

Glossary

Azure SSO and Blocking Anonymous Access

Though the production Client Data Portal lives on a private network, they still wanted to prevent anonymous users (not logged in) from accessing it and enforce Azure SSO logins only. To handle this, two CKAN extensions were used:

  • ckanext-azure-auth - “Adds authentication using Microsoft ADFS and Azure AD”
  • ckanext-noanonaccess - “…redirects anonymous users to the login page…” (this blocks access to every page, including the home page—unless logged in)

When not logged in, users will see this page:

Not logged In

Internal Usage Analytics

Google Tag Manager was used to track dataset page views and resource clicks. Custom Tags, Triggers, and Variables were created to track the specific actions of interest (see the GTM developer portal for more information). To connect the portal to GTM, ckanext-gtm was used, which inserts the GTM ID HTML into CKAN web pages, allowing it to trigger GTM tags.

The outcome

The final result is the Client Data Portal, an internal-use data management portal that empowers users to easily share datasets with their colleagues anywhere in the world, that provides the features that matter and skips the ones that don’t.

What's next

Datopian is currently handling support for the client while they go through the steps to onboard users to the new data portal. As feedback and ideas arise, Datopian will be happy to work with this Client on whatever comes next.

We are the CKAN experts.

Datopian are the co-creators, co-stewards, and one of the main developers of CKAN. We design, develop and scale CKAN solutions for everyone from government to the Fortune 500. We also monitor client use cases for data to ensure that CKAN is responding to genuine challenges faced by real organizations.

Related Case Studies