The 6 Types of PII Hiding in Your Business Documents That No Scanner Is Catching
Six PII categories routinely appear in Canadian business documents and get missed by manual review. Most basic scanners catch names and emails. These six require more sophisticated detection.
By Canuckt AI Team
Why Basic Scanning Misses the Hard Cases
Most PII scanning discussions focus on the obvious: names, email addresses, phone numbers. These are real PII and worth detecting — but they're also the categories that even basic tools handle reasonably well. The exposure that creates real compliance risk for Canadian businesses comes from the categories that require contextual understanding, Canadian-specific pattern recognition, or detection across non-standard document formats.
Here are the six PII types that consistently appear in Canadian business documents and consistently get missed.
1. Social Insurance Numbers in Non-Standard Formats
A Canadian SIN is nine digits, typically written as three groups of three separated by spaces (123 456 789) or hyphens. Simple pattern matching catches this format.
What it doesn't catch: SINs embedded in longer strings of numbers, SINs written as nine consecutive digits without separators, SINs in PDF forms where the field label is on a different line than the value, SINs in scanned documents where OCR has introduced character errors that break the pattern, and SINs that appear in columns of data without field labels nearby.
In practice, a significant portion of SINs in real business documents fall outside the standard format — particularly in older records, in documents that have been photocopied and rescanned multiple times, and in exported data from legacy systems that strip formatting. Canadian-specific detection needs to account for these variations and for the validation logic that distinguishes valid SINs from random nine-digit strings.
2. Quasi-Identifiers in Combination
A quasi-identifier is a piece of information that isn't personally identifiable on its own but becomes identifying in combination with other data. A birth year alone identifies nothing. A birth year plus a postal code plus a gender narrows a population down to very few individuals. A birth year plus a postal code plus a job title in a small organization may identify a specific person uniquely.
The research on re-identification risk shows that 87% of Americans can be uniquely identified using only their birth date, gender, and five-digit ZIP code. Canadian postal codes are even more granular — a three-character FSA (forward sortation area) covers a smaller geographic area than a US ZIP code, making the combination more identifying.
Business documents routinely contain combinations of quasi-identifiers without any single element being obviously sensitive: HR reports with age ranges and departments, customer analytics exports with regional and demographic breakdowns, market research with demographic cross-tabulations. Basic scanners look for sensitive field values, not for identifying combinations.
3. Health Card Numbers and Provincial Identifiers
Ontario health card numbers (ten digits), British Columbia Personal Health Numbers (ten digits), Alberta Personal Health Numbers (nine digits) — these are sensitive personal information under provincial health information protection laws. They appear in insurance claim processing, in employment records where employers pay group benefits, in disability accommodation documentation, and in any context where an employee's medical treatment is relevant to their employment.
Unlike SINs, provincial health numbers don't have a uniform national format. A scanner built to detect Ontario health cards won't detect BC PHNs without specific provincial logic. Organizations operating across multiple provinces — or processing insurance claims for employees in multiple provinces — need multi-province health identifier detection.
4. Financial Account Details in Correspondence
Banking information appears in more document types than organizations typically realize. Direct deposit setup forms obviously contain it. But it also appears in: accounts payable correspondence with bank transfer details, supplier onboarding documents where banking information is collected for payment processing, legal correspondence about financial settlements that includes account details for funds transfer, and employee expense forms that include reimbursement account information.
The specific combination of institution number, transit number, and account number — the Canadian banking coordinates — is what you're looking for. This combination appears in multiple formats (MICR encoding on cheque images, typed in correspondence, in form fields), and the three-part structure requires multi-element detection rather than simple pattern matching.
5. Personal Information in Document Metadata
This is the category that surprises organizations most when they discover it. Word documents, PDFs, and spreadsheets carry metadata that their creators don't see and most recipients don't check: author name, company name, revision history including tracked changes that were "accepted," comments that were deleted but remain in the file, creation and modification timestamps, and in some cases previous versions of the document content.
A contract shared externally with metadata intact can reveal the negotiating history — comments like "client is pushing back on this clause, we can go to 60 days" that were added by a lawyer and deleted before sending but preserved in the file metadata. A report shared with a client may carry the full revision history showing earlier drafts with internal assessments that weren't intended for the client.
From a PII standpoint, metadata can contain personal information about people who had nothing to do with the final document: reviewers who commented, authors of earlier drafts, administrative staff who prepared the document. Metadata stripping before external transmission is a PIPEDA safeguards obligation that most organizations aren't meeting.
6. Third-Party Personal Information in Client Documents
This one creates liability that organizations don't know they have. When a client brings you a document — a contract, a business record, a correspondence file — that document may contain personal information about people who have no relationship with your organization and who certainly haven't consented to their information being processed by you.
A commercial lease brought by a client for legal review might contain personal guarantees with the guarantors' home addresses and SINs. A supplier contract might contain the names and contact details of the other party's employees. A financial statement might contain shareholder information including home addresses. A court file might contain personal information about witnesses, opposing parties, or third parties.
Your organization received this information incidentally, in the course of serving your client. PIPEDA still applies to it. The information must be protected appropriately, retained only as long as necessary, and handled in accordance with the principle of limiting collection — which means not using it for purposes beyond what the client engagement requires. Organizations that never inventory third-party PII in their client files typically discover it for the first time during an OPC investigation, not before.
The common thread across all six: these categories require detection that goes beyond pattern matching on obvious field values. They require contextual understanding, Canadian-specific logic, combination analysis, and the ability to inspect document structure beyond the visible text. That's the gap between basic PII scanning and the detection that Canadian compliance actually requires.
Two More Categories Worth Knowing
Beyond the six core types, two additional categories appear frequently enough in Canadian business documents to warrant specific attention.
7. Indigenous community membership and status information
First Nations status, Métis registry membership, and Inuit community affiliation are among the most sensitive personal information categories in the Canadian context. Status cards, band membership records, and treaty entitlement documentation appear in a range of business contexts: government service delivery, healthcare benefit administration, certain employment programs, and social service contexts.
This information is sensitive in ways that extend beyond the typical privacy analysis. It implicates community sovereignty, historical context around the misuse of Indigenous identity information, and specific legal frameworks under the Indian Act and related legislation. Standard PII detection tools have almost no coverage for this category. Organizations processing documents containing status or membership information need specific detection logic and the highest safeguards treatment.
8. Biometric reference data created by document processing
When documents containing photographs, signatures, or handwriting samples are scanned and processed through AI-based document understanding systems, those systems may create biometric reference data as a byproduct. A facial recognition system that processes a scanned passport photo creates a faceprint. A signature verification system that processes a signed contract creates a signature biometric.
This is an emerging category that most organizations don't think of as PII they're creating — they think of themselves as processing a document, not creating biometric data. But under PIPEDA and Law 25, the biometric reference data is personal information that was collected (by the processing system) without the individual's knowledge or consent. Organizations using AI-powered document processing need to understand whether their tools create biometric data as a byproduct and what their obligations are for that data.
How to Test Your Detection Coverage
Knowing that these categories exist is step one. Knowing whether your current detection tools actually cover them is step two.
A practical coverage test uses a set of synthetic documents — documents with realistic but fictional personal information — that include each category you want to test. Run your detection tool against this set and review the results: what did it catch, what did it miss, and what did it flag incorrectly as PII that wasn't?
For Canadian-specific categories, the test set should include:
- SINs in multiple formats (with spaces, with hyphens, without separators, with OCR character errors)
- Health card numbers from each province your organization encounters
- The financial account number format (institution-transit-account)
- Postal codes in combination with other quasi-identifiers
- Documents with metadata containing personal information
- Multi-page documents where PII appears on a different page from the identifying context
The false negative rate on your test set — the percentage of PII instances the tool missed — is your exposure estimate. For high-sensitivity categories like SINs, a false negative rate above 2-3% in a well-structured test set should trigger evaluation of whether your detection tool is adequate for your compliance obligations.
The PIPEDA Standard for Detection Adequacy
PIPEDA's safeguards principle requires protection appropriate to the sensitivity of the information. For an organization that holds SINs, health card numbers, or financial account details, "appropriate" means knowing where that information is — which in a document-heavy organization means automated detection capable of finding it across varied formats and document types.
The OPC hasn't published specific guidance on PII detection tool specifications. But its general approach to safeguards is clear: the measure of adequacy is whether the safeguard is proportionate to the sensitivity of what's being protected. An organization that holds thousands of documents containing SINs and uses a basic tool that misses 15% of them in irregular formats is not meeting the standard for safeguards appropriate to the sensitivity of a SIN.
The practical implication is that organizations in data-intensive industries — financial services, healthcare-adjacent businesses, professional services firms, payroll and HR processors — need detection coverage that actually covers the categories of sensitive information they handle, in the formats those categories actually appear in. Generic tools calibrated for US identifiers, processing English-only documents without metadata analysis, aren't that.
Protect your data before sending it to AI.
Shielk automatically redacts PII from your content — so your team can use AI tools safely.
Try Shielk Free