From DATASUS FTP to your SQL prompt.
Fifteen stages, every transformation justified, with before-and-after examples. Read this once and you'll know exactly what the pipeline changed about the raw data — and what it didn't touch.
- 01
Origin: the DATASUS public FTP
Brazilian public-health microdata is published by the Ministry of Health on a public, anonymous FTP server: ftp.datasus.gov.br. Each subsystem (SIHSUS = hospital admissions, SIM = mortality) keeps its monthly or yearly files in a fixed directory. The pipeline does not scrape — it lists the directory, picks files matching a pattern, and downloads only what falls inside the user-specified date range.
- 02
Filename parsing
Filenames encode UF (state) and period. SIHSUS uses RDUFYYMM.dbc — for example RDSP2401.dbc means São Paulo, January 2024. SIM is yearly and bifurcated by ICD revision: DOSP2023.dbc is CID-10 (1996+, 8-char stem); DORSP80.dbc is CID-9 (1979–1995, 7-char stem). The parser disambiguates by stem length, not by prefix — checking "starts with DOR" first would silently drop CID-10 deaths from RJ, RN, RO, RR, and RS.
- 03
Decompression: DBC → DBF
DBC is a DATASUS-proprietary compression layered over the venerable DBF table format. Historically you needed Windows-only tools (TabWin) to decompress it. The pipeline uses the pure-Python datasus_dbc library so the whole chain is cross-platform.
- 04
Loading: DBF → DuckDB staging
Each DBF is streamed into a fresh in-memory DuckDB connection as a staging table. Two columns are added at this point: source_file (for audit trails) and uf (extracted from the filename). String columns are stripped of invisible junk (tabs, newlines, NULs) and trimmed.
- 05
Cleaning: invisible characters and empty strings
Raw DBF rows often contain whitespace and control characters that look fine in a text editor but break joins and uniqueness checks downstream.
Before After "SP\t\n\r" "SP" "" (empty string) NULL " 04 " "04" - 06
Type conversions: TRY_CAST
Every numeric and date column goes through TRY_CAST so a single malformed row doesn't abort the whole import — invalid values become NULL and a warning is logged. Integer columns include age, length-of-stay (days), counts. Float columns include monetary fields like VAL_TOT.
- 07
Date parsing — five formats, ordered
Dates in DATASUS files are not consistent. The parser tries five formats in order: YYYYMMDD, DDMMYYYY, DMMYYYY (a 7-digit format unique to SIM, where the day has no leading zero), YYYY-MM-DD, and as a last resort a generic cast.
Subsystem Raw Parsed SIHSUS "20200131" 2020-01-31 SIM "1012023" 2023-01-01 (DMMYYYY) Either "2020-01-31" 2020-01-31 - 08
Sex normalisation
Each subsystem encodes sex differently. The pipeline collapses both into a common M/F/I (indeterminate) code so cross-subsystem queries don't have to special-case each table.
Subsystem Raw Normalised SIHSUS 0 / 1 / 3 I / M / F SIM 1 / 2 M / F SIM M / F (already textual) M / F - 09
Race and colour (RACACOR)
RACACOR carries the same five IBGE categories in both subsystems but with different numeric codes. The pipeline maps them to the human-readable Portuguese label.
Subsystem Code Label SIHSUS "01" "Branca" SIHSUS "02" "Preta" SIM "1" "Branca" SIM "2" "Preta" Both "03/3" "Parda" - 10
Cause-of-death arrays (SIM)
SIM stores chains of CID-10 codes for each line of the death certificate (linhaa, linhab, linhac, linhad, causabas, linhaii) as star-delimited strings. The pipeline parses them into typed VARCHAR[] arrays so SQL can use UNNEST and array operators directly.
Raw Array "*A01*J128" ['A01', 'J128'] "*B342" ['B342'] "" or "*" [] (empty) - 11
Encoded age (SIM)
SIM packs the unit and the value of age into a single 3-digit field, where the leading digit is the unit (1=minutes, 2=hours, 3=months, 4=years, 5=>100 years). The pipeline decodes it into idade_valor (number) and idade_unidade (text).
Raw Decoded value Decoded unit "403" 3 anos "512" 12 >100 anos (i.e., 112) "307" 7 meses - 12
Geographic enrichment (IBGE join)
Both subsystems store a 7-digit IBGE municipality code (munic_res in SIHSUS, codmunres in SIM) but no human-readable city name. The pipeline LEFT JOINs against a built-in IBGE reference table to add the city name, the (already-known) state, and the macro-region.
Before After munic_res = 3550308 municipio_res = "São Paulo", uf = "SP", regiao = "Sudeste" munic_res = 3304557 municipio_res = "Rio de Janeiro", uf = "RJ", regiao = "Sudeste" - 13
Storage: partitioned Parquet
The transformed table is written to {data_dir}/datasus_db/{subsystem}/uf={UF}/{filename}.parquet using Hive-style partitioning. Partitioning by UF means a query that filters on a single state reads only that state's files. ZSTD compression keeps the disk footprint small without hurting read speed.
- 14
Query interface: DuckDB VIEW
The Parquet files are exposed as a single virtual table per subsystem (sihsus, sim). The view does a partition-aware glob (read_parquet('datasus_db/sihsus/uf=*/**.parquet')) so scans automatically prune partitions when the WHERE clause filters by UF or by year.
- 15
Custom queries — your turn
From the Query page you can write any SQL you like against the cleaned, enriched, partitioned data. Built-in column histograms (the # button next to each column) one-tap a GROUP BY query that includes the % population for each value, so distributions are visible without typing.