Software and Data

Software, data products, and applied data projects built for research and policy work.

AI research tool

Ask Atlas

Ask Atlas is a production AI assistant for the Atlas of Economic Complexity, answering natural-language questions over roughly 60 trade-data tables across seven PostgreSQL schemas. It combines a LangGraph ReAct agent, hybrid documentation retrieval, SQL execution tools, Atlas visualization links, and a React/TypeScript frontend with streaming responses, feedback, conversation history, and an evaluation harness that turns production feedback into new test cases.

Python package

py-ecomplexity

py-ecomplexity is a Python package implementing the Hidalgo-Hausmann economic-complexity methodology, including RCA, RPOP, binary presence, ECI, PCI, proximity, density, COI, COG, and related measures. It has 33,000+ PyPI downloads, 80+ GitHub stars, an MIT license, and is used in the Atlas of Economic Complexity pipeline and by external research groups.

Dataset and pipeline

glocal

glocal aggregates 15+ raster layers to GADM administrative levels 0-2 and GHS urban centers, covering nighttime lights, elevation, ruggedness, PM2.5, climate, population, land use, roads, and related spatial layers. The pipeline uses R exactextractr, Google Earth Engine, Dask, DuckDB, Apache Parquet outputs, and SLURM scheduling on Harvard research computing infrastructure.

Research search system

Growth Lab Deep Search

Growth Lab Deep Search is an agentic RAG system for querying the Growth Lab’s unstructured research corpus with citation-grounded synthesis. Its pipeline combines large-scale PDF parsing, token-aware chunking, Qwen3 embeddings, SLURM orchestration on GPU nodes, FastAPI, and a LangGraph search agent for query decomposition, parallel retrieval, relevance grading, and incremental updates on GCP Cloud Run.

Embeddings package

econ-embeddings

econ-embeddings provides pre-computed semantic embeddings for four economic classification systems: HS products, IPC4 patents, NAICS industries, and OpenAlex scientific concepts. It exposes a one-line concord() API for top-K matches across domain pairs, with a generation pipeline that enriches classification labels using LLM-written descriptions and embeds the resulting text with Gemini embeddings.