Internal Data Catalog: A Journey to Better Data Management with CKAN
Datopian
Not disclosed due to contractual obligations
Brief summary of the project.
Datopian collaborated with a global consulting and engineering company to develop a customized data management portal using CKAN, addressing specific requirements to streamline data accessibility and usability for their global team.
Client's in-house data catalog solution was overly complex and technical, making it unsuitable for their users' needs. They required a simpler, user-friendly solution to manage and share datasets efficiently across their organization.
The Client needed a straightforward data catalog to facilitate easy inter-organizational dataset sharing, with features including data access permissions, usage analytics, Azure SSO login, automated Azure user and organization synchronization, and custom dataset metadata schemas.
Datopian designed and implemented a CKAN-based data portal with a customized UI, integrated Azure SSO, and synchronized user and organization data from Azure to CKAN. The solution included custom metadata schemas, analytics tracking, and an intuitive interface, enhancing data management and accessibility.
Context
NOTE: The client cannot be disclosed due to contractual obligations, and the screenshots have been anonymized by changing brand colors and removing the logo.
The Client is a global consulting and engineering business that collaborates with customers, colleagues, and partners to craft sustainable and long-lasting solutions that improve quality of life. It’s comprised of thousands of people, with expertise in engineering, architecture, energy, and environment, located around the world. Their wide reach provides an extensive pool of bright minds available for collaboration, which gives them an edge when solving hard problems.
Their ambitious sustainability goals, including full carbon neutrality within the next 30 years and 100% of revenue coming from projects driving sustainability within the next 5, are evidence of their vision for a better future.
The situation
When the Client first contacted Datopian, they had already attempted to adopt an in-house data catalog solution, but it proved too complicated and technical for their users’ use cases. What they needed wasn’t necessarily feature-rich, rather, it was feature-specific.
The criteria
At a high level, the main requirement was a simple data catalog—an internal data management solution with easy inter-organizational dataset sharing for thousands of users around the world that could streamline operations and enhance data accessibility.
With that as the base requirement, the additional criteria were:
- Avoid duplication of data
- Data accessibility is based on a per-user basis (per-user access permissions)
- Track data and page usage (usage analytics)
- Azure SSO login
- Automated syncing of Azure users and organizations to the data portal
- Custom dataset metadata schema
- Easy to use for non-technical users
The solution
Datopian worked closely with the Client throughout the implementation process to ensure that everything, down to the smallest features, would not only “get the job done” but improve their processes and enhance the data management process as a whole. The implementation of CKAN facilitated seamless enterprise data integration, allowing for consistent data management practices across the client's global operations.
The UI
A growing number of Datopian’s clients are opting for decoupled frontends to allow for robust customization. To keep things simple, the Client Data Portal doesn’t. All theming and UI changes were done within CKAN itself, keeping maintenance and complexity minimal (for the code and infrastructure).
The Home Page
The site’s home page is straightforward. A fairly standard navigation bar sits at the top (present on every page), with a search field for datasets, and links to “Datasets”, “Client Organizations” (normal CKAN Organizations), “Subjects” (custom Group type), “Geography” (custom Group type), “About”, and a link to the current user’s profile page.
Below the navigation bar is a welcome banner with four links, each representing a “Data Type” (a custom schema field) of the portal's datasets. The numbers next to each type indicate the number of datasets Below the navigation is a welcome banner with four links, each representing a “Data Type” (a custom schema field) of the portal's datasets. Clicking on one of the types will redirect to the search page with that type of dataset selected.
Below the welcome banner, there are two tabs. The first—the default when the page loads—is “Sectors” (another custom dataset metadata schema field), which contains the four sectors that can be associated with a dataset. As expected, clicking on one of the sectors redirects to the search page with results for that sector.
Clicking on the second tab will switch to the “Subjects” list, where users can navigate to the search results for a given subject:
The Search Page
Client’s custom group types (“Subjects” and “Geography”) have been added as filters, along with their custom dataset metadata schema fields (“Sectors” and “Data Types”).
The Dataset Metadata Form
Datasets are where the most changes have taken place, both in how they’re presented and within the metadata itself. We introduced a custom metadata schema to better organize and categorize datasets. The new schema fields are:
- “Publisher” - Select a tag from the custom “Publishers” CKAN Tag Vocabulary (creates a new tag if not found)
- “Sector” - Select one of “Cities”, “Transportation”, “Environment”, or “Miscellaneous”
- “Data Type” - Select one of “Public”, “Bought”, or “Internal”
- “Data Published” (Valid Period Start) - Start date of data coverage (select from calendar)
- “Data Valid Until” (Valid Period End) - End date of data coverage (select from calendar)
- “Update Frequency” - Select one frequency from a pre-defined list, with values such as “Weekly”, “Monthly”, “Irregular”, etc.
- “Geography” - Select one or more Geographies (CKAN custom group type) associated with the dataset
- “Related Datasets” - Select one or more datasets associated with the dataset
- “Super Users in Client” - Select one or more users who are familiar with the dataset
- “Image” - An optional image that represents the dataset
Additionally, search and autocomplete functionality (using Select2) has been added to all of the “Select…” fields (including the default CKAN fields where applicable):
The Dataset Metadata View Page
The layout of the metadata view page has changed quite a bit. It now opts for a more streamlined presentation, with unnecessary elements removed. For example, removed elements include:
- The “Dataset”, “Groups”, and “Activity Stream” tabs
- Social media links
- Organization image
Here’s an example of the vanilla CKAN dataset page:
You might also notice that “Data and Resources” has been removed from the Client Data Portal. That section has been replaced with the “Access Data” dropdown button, which cleans up the page, yet still provides the functionality to view the resource titles, descriptions, and types, as well as navigate to the resources themselves:
To the right of the title, description, and Access Data button is the dataset image:
Users can click on “Fullscreen” to expand the image:
Next, the metadata is neatly laid out, making it easily readable (and clicking on a metadata item will redirect to the search page with similar results):
Each metadata item on the page can be hovered over to provide a description:
To the right of the main metadata section, the Data Custodian and Steward information is provided, including the user's “Department” (more on this in the “User Page and Azure Syncing” section—this is a custom User metadata field), email address, and links to the search page with results for all datasets where the user is either a Custodian or Steward:
The User Page and Azure Syncing
The user page hasn’t changed much, but there is one important change. It now includes the “Client Organization” (also known as “Department”) the user is a part of, along with their “Job Title”:
Users, select user information, and Organizations (Departments) are automatically synced daily from Azure to CKAN. This is handled by a custom CKAN CLI command that gets run using a daily Cron job. The general flow is:
Cron job triggers the “sync-users” CLI command
- Initialize a “ConfidentialClientApplication” using the msal Python library
- Validate credentials (these are set as secrets in the infrastructure)
- Get access token
- Get all users in Azure (including their Job Title, E-mail Address, and Department)
- Iterate over users
- Validate user (filtering out test accounts, users from Departments that aren’t relevant, etc.)
- Check if the user already exists in CKAN and then either create or update their CKAN user
- Check if the user’s Department already exists in CKAN as a Client Organization
- If not, create a new Organization
- Add the user as an Editor to the Organization (if they’re not already)
The Glossary
The last change is a static page containing a “Data Glossary”. This provides a quick reference for general key terms as well as specific metadata fields:
Azure SSO and Blocking Anonymous Access
Though the production Client Data Portal lives on a private network, they still wanted to prevent anonymous users (not logged in) from accessing it and enforce Azure SSO logins only. To handle this, two CKAN extensions were used:
- ckanext-azure-auth - “Adds authentication using Microsoft ADFS and Azure AD”
- ckanext-noanonaccess - “…redirects anonymous users to the login page…” (this blocks access to every page, including the home page—unless logged in)
When not logged in, users will see this page:
Internal Usage Analytics
Google Tag Manager was used to track dataset page views and resource clicks. Custom Tags, Triggers, and Variables were created to track the specific actions of interest (see the GTM developer portal for more information). To connect the portal to GTM, ckanext-gtm was used, which inserts the GTM ID HTML into CKAN web pages, allowing it to trigger GTM tags.
The outcome
The final result is the Client Data Portal, an internal-use data management portal that empowers users to easily share datasets with their colleagues anywhere in the world, that provides the features that matter and skips the ones that don’t.
What's next
Datopian is currently handling support for the client while they go through the steps to onboard users to the new data portal. As feedback and ideas arise, Datopian will be happy to work with this Client on whatever comes next.