# Normalization Rules

Version: 1.0.0-public  
Date: 2026-03-19  
License: jp-election-data License v1.0

---

## 1. Purpose

This document publishes normalization rules derived from public information:

- official election commission labels
- public municipality and prefecture names
- public block labels
- public Gazetteer-based romanization references

These rules are intentionally separated from internal pipeline orchestration.
They are published because they support reproducibility, independent verification,
and downstream research use.

---

## 2. Boundary

Included here:

- public-information-based string normalization
- alias normalization used for join safety
- aggregate-row detection rules based on published labels
- romanization source priority and fallback policy

Not included here:

- source acquisition heuristics
- retry / handoff / escalation rules
- OCR-specific operational workarounds
- internal agent workflow logic

---

## 3. Canonical Key Principle

Japanese canonical identifiers remain the source of truth.

- prefecture-level key: `prefecture_code` (JIS X 0401)
- municipality-level key: `jis_code` (JIS X 0402)

Romanization and English-facing labels are presentation-layer derivatives.
They must not replace canonical Japanese keys.

---

## 4. Baseline String Normalization

The baseline normalization used for alias resolution is:

1. Unicode normalize with `NFKC`
2. convert NBSP and ideographic spaces to ASCII space
3. trim leading/trailing whitespace
4. collapse repeated whitespace to a single ASCII space

Reference pattern:

```text
\s+ -> " "
```

Compact lookup form may additionally remove spaces entirely:

```text
" " -> ""
```

This compact form is used only as a fallback key for public-name matching.

---

## 5. Public Aggregate-Row Detection

The following row classes are treated as aggregate or non-leaf labels when
they appear in official published tables.

### 5.1 Generic aggregate suffix

```regex
.*(合計|総計|小計|計)$
```

### 5.2 Broad regional aggregate suffixes

These are treated as non-municipality summary rows:

```regex
.*(支庁|振興局|総合振興局)$
```

### 5.3 Explicit published aggregate labels

These labels are treated as aggregate rows when they appear in turnout/result
tables:

- `指定都市計`
- `その他の市計`
- `町村計`

### 5.4 Parenthetical note rows

Rows that are only a parenthetical note are treated as non-data rows:

```regex
^[（(].*[）)]$
```

### 5.5 Numeric-only rows

Rows that contain digits only are treated as non-name rows:

```regex
^\d+$
```

---

## 6. Scope and Region Normalization

Public shorthand labels may be normalized, but the official election label and
the project execution label are not always identical.

Examples:

- ops region canonical label: `中国地方`
- accepted shorthand alias: `中国`
- official election metadata may still retain `pr_block=中国`

Meaning:

- region-layer normalization is allowed for public documentation and lookup
- official election labels are preserved where they are part of the source truth

---

## 7. Romanization Policy

Romanization is derived and anchored by `jis_code`.

Source priority:

1. `gazetteer_japan_2021.csv`
2. `municipality_romanization_overrides.csv`
3. blank / unavailable

Published fields may distinguish:

- `municipality_name_romanized_official`
- `municipality_name_romanized_effective`
- `romanization_status`
- `romanization_source`

Interpretation:

- `official`: direct Gazetteer match
- `fallback_reviewed`: override with explicit review
- `fallback_provisional`: override exists but should still be treated as auxiliary
- `unmatched` / `unavailable`: no reliable romanized value

These romanized values are informative and must not be used as the primary join key.

---

## 8. Recommended Public Publication

The following assets are recommended for publication because they are public-info
normalization infrastructure rather than internal pipeline mechanics:

- `area_alias.csv`
- `region_block_alias.csv`
- `scope_alias.csv`
- `gazetteer_japan_2021.csv`
- `municipality_romanization_overrides.csv`
- this `normalization-rules.md`

---

## 9. Research Use Note

These rules are published to support:

- reproducible parsing of Japanese election labels
- independent verification of public structured facts
- English-facing presentation layers anchored by JIS codes
- future academic or civic reuse of normalization logic derived from public sources

They should be treated as public normalization assets, not as a general-purpose
data automation framework.
