Driving Innovation: AIR’s Data Portal for Fighting Corruption
Datopian
Brief summary of the project.
The Alliance for Innovative Regulation (AIR) hosted a TechSprint aimed at enhancing government transparency to combat corruption. To support participants, a centralized, bilingual data portal was developed by Datopian, offering seamless access to datasets, integrated problem statements, and advanced metadata management features. This enabled participants to create impactful, data-driven prototypes.
TechSprint aimed to tackle corruption by improving government transparency through innovative digital tools. However, participants faced challenges accessing and utilizing the necessary datasets and problem statements in a structured and user-friendly way. The lack of a centralized, searchable platform hindered the ability to link data with solutions effectively.
The project required a data portal that could centralize datasets from multiple sources, provide metadata-rich pages with data previews and version history, support bilingual content for English and Spanish speakers, and integrate problem statements into a cohesive and accessible platform.
Datopian created a robust data portal that centralized all required datasets in a searchable catalog, provided bilingual support across metadata and static content, and seamlessly integrated problem statements using programmatically indexed markdown files. With advanced metadata management powered by GitHub and Frictionless Data Packages, the portal enabled participants to efficiently access and utilize the resources necessary for building high-quality prototypes, advancing TechSprint's mission of promoting transparency and fighting corruption.
Context
Finance is one of the sectors that has benefited the most from the latest technological advancements. The Alliance for Innovative Regulation (AIR) is a global nonprofit, non-membership organization dedicated to ensuring that the digital transformation of finance is matched by the digital transformation of financial regulation, equipping regulators to meet their mandate to make the financial sector resilient, inclusive, and fair.
AIR believes that financial regulation should be a potent force for good. It should operate as an invisible force in people’s lives, ensuring that consumers are shielded from discrimination and abuse, that small businesses can access capital, that criminals cannot use the banking system to hide illicit activity, and that financial catastrophes are averted. To help achieve its vision, AIR educates, connects, and supports the regulatory ecosystem to help realize these goals.
The Situation
One of AIR’s initiatives includes TechSprints, which are hackathon-like collaboration events in which participants work together to identify solutions for complex challenges. While TechSprints can vary widely, the core idea is that cross-functional teams of experts compete to solve difficult problems facing consumers, regulators and/or industry through technology. Each TechSprint culminates in a “demo day,” where teams unveil prototypes to a panel of judges and a wider audience. Winning concepts sometimes move into formal incubation.
Previous TechSprints covered topics such as anti-corruption, financial crime and human trafficking, and climate-positive blockchain. In 2024, AIR organized a new TechSprint focused on enhancing government transparency as an anti-corruption tool.
The Criteria
To support this TechSprint event, a data portal was needed, so participants could find the event-related general information and datasets necessary to support prototyping activities. The main requirements were:
- The styling’s alignment with the TechSprint’s theme and branding, along with contextual information regarding the event, including its main topic and goals.
- Article-like static pages that would serve to inform users about the five problem statements that were defined for the TechSprint.
- Datasets from multiple sources had to be indexed and made available on a data catalog with full-text search and data resource filtering.
- The dataset's page should display its metadata, including descriptions and resources available for download, resources sample preview, data dictionary, and previous versions.
- Support for all the static content and metadata, such as the resource’s data dictionary to support bi-lingual TechSprint activities
The Solution
Preparing the data
To get started, the AIR team provided the Datopian data engineering team with a list of datasets that had to be included in the data portal. The challenges were that the data for these datasets were stored in multiple places, and the metadata wasn’t programmatically available anywhere.
To handle this, first, the Datopian data engineering team crafted ETL scripts to ingest the data files into a centralized bucket.
Preparing the metadata
Then, the AIR team provided a spreadsheet containing the metadata for the datasets. To standardize and make the metadata programmatically available, it was decided that GitHub would serve as the metadata storage.
Frictionless Data Packages
The metadata for each dataset would be stored as Frictionless Data Packages, a simple, standardized way to organize and share collections of data (packages), aimed towards reducing the "friction" that often complicates data management, sharing, and usage, making it accessible, understandable and interoperable. The Data Package (which is simply a datapackage.json
file) contains references to data files along with a detailed description of what's inside and how it is structured. This description helps anyone who uses the data to understand it quickly and accurately without needing extra explanations.
GitHub metadata storage
Hence, with metadata being stored as JSON files, GitHub provides a set of very useful features, practically turning the metadata storage into a rich metadata management system:
- It allows users to navigate and visualize all the metadata files, as well as see their change history, which can be used for rollbacks or auditing.
- It enables users to use the GitHub UI to create and update metadata directly from their browser, easing external collaboration.
- Because changes have to be made on alternative branches, GitHub enforces an approval workflow for contributions, allowing maintainers to review and discuss changes before integrating these into production.
- With GitHub issues, maintainers can keep a backlog with future work, such as bug fixes and enhancements, as well as keep track of past work.
Home page
For the homepage, we took inspiration from the OpenSpending data portal. Its traits were that it was brief and minimalist, straight to the point, and elegant.
The homepage, on its hero section, provides users with an overview of what can be found on the data portal, with statistics on the count of datasets, data files, and data resource diversity.
Immediately below the hero section, the searchable data catalog can be found, allowing users to explore or search datasets by text and filters, such as data resource, and grasp the context of each dataset with the metadata that is provided on the search result cards.
Dataset page
On the data catalog, when a user finds a dataset he’s interested in, upon clicking on it he’ll be directed to the dataset page, where the complete information about the dataset can be found.
On this page, besides contextual metadata such as the dataset’s description and country, the actual data can also be discovered, previewed, and downloaded.
The page displays a separate section for each data file, containing its own contextual information, rich data preview with filtering and sorting, data dictionary, and previous versions, as well as links to the raw data files when applicable.
Problem statements
For this TechSprint, participants worked on five problem statements related to improving existing government transparency initiatives, including as related to developing AI-powered chatbots, social media integrations, transparency platform user experience and interface, low-code applications and self-service tools, infrastructure and development gap analysis, and public data quality.
These problem statements had to be integrated into the data portal as references for Techsprint participants.
To achieve this, the Datopian team migrated the content of each problem statement to markdown files and leveraged MarkdownDB and PortalJS, respectively, to index these files and make them programmatically searchable and to render the content for each problem statement as a dedicated page.
Internationalization
Finally, in order to make the data portal bilingual and easy to use both for English and Spanish speakers, internationalization was implemented in all aspects. This ranges from the static content, such as on the hero section and the problem statements, to the metadata itself, such as on the data files data dictionaries.
Users can easily spot the language toggle on the page header and switch to their preferred language.
The outcome
The data portal was a key factor in TechSprint's success. Specifically, participants were able to access public data through the data portal, which led to an increased quality of prototypes created by TechSprint teams. Participants were able to link the problem statements to necessary datasets in the data catalog. Ultimately, the use of cataloged public data helped teams to create prototypes anticipated to make a tangible impact on transparency and anti-corruption outcomes..
What's next?
The data portal created by these efforts provided a foundation for AIR to incorporate increasingly diverse and complex datasets in future TechSprint initiatives. With this strong foundation, AIR will seek opportunities to leverage financial open data initiatives as part of TechSprints hosted across a variety of topics in more countries and institutions across the world.