Skip to content

Do ATIP/Request on Census of Population, Census of Agriculture, and tables data best practices #41

@diegoripley

Description

@diegoripley

Sent to me by someone that knows more about Census data limitations.

What you’re running into is a very well-known limitation of public census products. What you have (the 2,239 variables at Dissemination Area level from the Statistics Canada 2021 Census) are marginal aggregates, not microdata cross-tabs.

That means you only know:

  • Count(A)
  • Count(B)

but not:

  • Count(A ∧ B)

So an SQL like:

WHERE low_income_rate > 20
AND visible_minority_rate > 30

does not mean those same people meet both conditions.

You are exactly right about the double counting / ecological fallacy problem.

Below are the actual ways analysts solve this.


1. Public Use Microdata Files (PUMF)

The closest public solution is the microdata sample.

You would use:

  • 2021 Census Public Use Microdata File (PUMF)

This contains individual-level records where you can legitimately run:

SELECT COUNT(*)
FROM census_microdata
WHERE income < threshold
AND immigrant_status = 'recent'

BUT

Two major limitations:

1️⃣ Geography is coarse
Usually province or CMA, not DA.

2️⃣ Sample only (~2–3%)

So you lose fine spatial resolution.


2. Statistics Canada Research Data Centres (RDC)

The correct answer for DA-level cross tabulations is the secure environment.

You would use:

  • Statistics Canada Research Data Centres

Inside RDC you can access:

  • full census microdata
  • NHS / census individual responses
  • linked administrative datasets

Then you can generate custom tabulations.

Example:

DA x immigrant_status x income_quintile x education

Output rules

Cells are suppressed if:

  • counts < threshold
  • disclosure risk

But DA-level tables are sometimes allowed depending on variables.


3. Custom Tabulations from Statistics Canada

If you don't have RDC access, you can request:

  • Statistics Canada Data Liberation Initiative
  • Custom tabulations through Statistics Canada Client Services

You request something like:

Geography: Dissemination Area
Variables:
- Visible minority
- Education
- Income decile

They produce a cross-tab dataset.

Downsides

  • Expensive
  • Takes weeks/months
  • Suppression rules apply

4. Synthetic Population Reconstruction (What many researchers do)

This is the most common workaround when you only have DA aggregates.

Method:

1️⃣ Use DA marginal totals

Example:

Variable Count
immigrants 150
low income 120
university 90

2️⃣ Use microdata sample (PUMF) to estimate joint distributions

3️⃣ Run Iterative Proportional Fitting (IPF) or raking

Result:

A synthetic population that statistically matches all DA marginals.

Then you can run:

AND
OR
multivariate regression
agent models

This is widely used in:

  • urban modelling
  • epidemiology
  • transportation planning

Tools:

  • ipfn (Python)
  • synthpop
  • popgen
  • simPop

5. Spatial probabilistic estimation (small area estimation)

Another method is:

Bayesian hierarchical modelling

Example:

Estimate:

P(low_income AND immigrant | DA)

using:

  • DA marginals
  • census microdata
  • neighbouring DA covariates

Often done with:

  • Stan
  • PyMC
  • INLA

Used in:

  • poverty mapping
  • health inequality studies

6. Data sources outside census

Some Canadian datasets contain joint variables, but geography differs.

Examples:

Dataset Strength Weakness
Canadian Community Health Survey rich variables not DA
Labour Force Survey employment detail not small area
Taxfiler data income detail restricted

7. What many GIS analysts mistakenly do

A very common but incorrect practice:

DA where:
immigrant_rate > 20%
AND
poverty_rate > 20%

This identifies high co-occurrence areas, not individuals.

This is fine for:

  • spatial targeting
  • policy screening

But not causal inference.


8. The correct conceptual model

Think of your DA dataset as:

Marginal distributions

You need:

Joint distributions

Ways to obtain them:

Microdata
↓
Synthetic reconstruction
↓
Custom tabulations
↓
Secure research environments

9. If I were designing your pipeline (given your GIS + intelligence background)

I would build a 3-layer model:

Layer 1

DA aggregates (GeoParquet)

Layer 2

Census PUMF microdata

Layer 3

Synthetic population generator

Output:

synthetic_individuals.parquet

Fields:

DAUID
age
sex
immigrant_status
income_decile
education
household_size
visible_minority

Then your queries become legitimate:

SELECT *
FROM synthetic_population
WHERE immigrant = 1
AND low_income = 1

10. Why this matters for the kinds of projects you're doing

From your previous geospatial and policy analytics work (federal grants, hotspot analysis, etc.), this lets you correctly model:

  • policy targeting
  • inequality intersections
  • voting geography correlations
  • service accessibility

Without falling into ecological inference errors.


Short answer

You cannot run true AND queries with the public DA dataset because it contains only marginal counts.

You need one of:

1️⃣ Census microdata (PUMF)
2️⃣ Research Data Centre access
3️⃣ Custom tabulations from Statistics Canada
4️⃣ Synthetic population reconstruction


If you want, I can also show you something very powerful that almost no Canadian analysts exploit:

How to reconstruct a full synthetic population for every DA in Canada using:

  • Census DA tables
  • PUMF
  • iterative proportional fitting
  • GeoParquet + DuckDB

This produces 30 million synthetic individuals geographically assigned to DAs, allowing unlimited AND / OR spatial queries.

Metadata

Metadata

Assignees

Labels

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions