A platform sitting on massive public data had no way to connect it. Siloed systems, duplicate records, and manual cross-referencing kept the team stuck. An AI-powered knowledge graph changed that — turning stagnant data into a compounding intelligence asset.
- Built AI that resolves millions of identities — catching duplicates that manual methods missed entirely.
- Turned a stagnant data repository into a living intelligence engine where each new dataset enriches everything that came before it.
- Scaled data throughput 5× on the same team — the graph and event-driven architecture absorbed the growth.
Thousands of datasets, one coordination problem
Google Data Commons is an open data platform that aggregates and publishes datasets from governments, research institutions, and organizations around the world — census data, climate records, economic indicators, health statistics, and more. The mission is straightforward: make the world’s public data accessible and useful.
The execution is anything but simple. Data Commons imports and refreshes datasets from dozens of sources continuously. Each source has its own schema, its own update cadence, its own identifiers. Keeping thousands of datasets in sync — accurately, at scale, without manual intervention — was the core engineering challenge.
Cross-referencing entities across sources was slow. The same place, variable, or statistical concept could appear under different names in different datasets. Manual reconciliation ate engineering time. And as the number of datasets grew, the infrastructure couldn’t keep pace — more data coming in, no way to process it faster without throwing more people at the problem.
An AI that doesn’t just store data — it understands it
The answer was a knowledge graph built on BigQuery — not a static database, but a living structure where every entity, relationship, and attribute connects to everything else. Google Data Commons brought in CLOUDSUFI to design and build the AI layer that made it work at scale.
Three innovations did the heavy lifting:
- AI schema mapping that learns new dataset structures on its own. When a new source lands, the system reads its shape, figures out how it maps to the existing graph, and plugs it in — no manual integration code required.
- An entity resolution engine built on Vertex AI that matches statistical variables and place identifiers across sources with accuracy that manual methods couldn’t touch. It catches mismatches, merges fragments, and builds a single clean identity — even when names differ, formats don’t match, and IDs disagree.
- An event-driven architecture on Google Cloud that updates the graph in real time. No nightly batches. No stale snapshots. The graph stays current as datasets flow in from governments and research institutions worldwide.
The result was a shift from a fragmented import pipeline to a compounding intelligence asset. Each new dataset doesn’t just add rows to BigQuery — it enriches every entity already in the graph. Connections that were invisible before surface on their own. The more data goes in, the more useful the whole platform becomes.
CLOUDSUFI brings a rare combination of strategic data expertise and execution excellence. Their team understands the complete data journey — from sourcing and governance to discovery and monetization. Rather than simply deploying technology, they focus on making data usable, trusted, and valuable across the organization. That end-to-end perspective, combined with their specialized approach to organizing and operationalizing enterprise data, made them an invaluable partner.
Five times the throughput, zero new hires
Data throughput grew 5× — and the Data Commons team didn’t hire a single new engineer to handle it. The knowledge graph and AI automation absorbed the volume. Dataset imports that used to require manual schema work now happen on their own.
Manual processing dropped by 40%. Schema mapping, data integration, and record matching — tasks that used to eat weeks of engineering time — are now handled by AI. Engineers moved from plumbing to product work.
Entity resolution accuracy jumped from 60–70% to 95%+. Millions of identity records resolved correctly. Ghost records stopped polluting reports. Downstream analytics finally had a foundation they could trust.
But the biggest change is structural. The graph compounds. Every new dataset imported into BigQuery doesn’t just sit alongside the old ones — it makes the old ones more valuable. Connections between statistical variables surface automatically. What started as a data engineering problem became the backbone of one of the world’s most comprehensive open data platforms.
CLOUDSUFI has been an effective partner for Data Commons, supporting large-scale data ingestion, automation, and managed services. Their team demonstrated strong technical expertise in building scalable validation, monitoring, and schematization tools, consistently improving data reliability and throughput. Given their disciplined execution on critical projects, I expect CLOUDSUFI to play a key role in scaling Data Commons 10x over the next 12–18 months.
What’s next: from resolution to prediction
The knowledge graph is now the foundation, not the finish line. The Data Commons team is expanding into deeper cross-domain analysis — using the graph’s connected structure to surface correlations between climate, economic, and health datasets that no single source could show alone.
New government and institutional data sources are continuously being onboarded into BigQuery. Natural language querying via Vertex AI is expanding the platform’s reach — so researchers and policymakers can ask questions in plain English instead of writing SQL against a schema they need to memorize.
The stagnant repository is gone. In its place is an engine that gets better every time new data arrives.
CLOUDSUFI is a Google Cloud Premier Partner specializing in data engineering, AI, and cloud migration.
Talk to us →How Swarovski personalizes luxury at global scale with AI-led data migration
How Swarovski moved from SAP and ABAP to a cloud-native BigQuery lakehouse — 35% faster migration, 100% data accuracy, 40% ops capacity freed.
Read story →How SMI Cabinetry turns architectural drawings into structured cabinet estimates in 30 seconds
How SMI Cabinetry automated cabinet estimation — from architectural drawing to structured BOM in 30 seconds.
Read story →How Saudi Aramco cut media response time from 28 days to 12 with an AI-powered engine
How Saudi Aramco’s media team went from 28-day response cycles to 12 days with four AI agents.
Read story →