Microsoft Purview for AI Governance: Classifying and Protecting AI Training Data
AI training pipelines bypass traditional DLP controls because they access data as bulk blob reads, not document downloads. This guide shows how to configure Microsoft Purview specifically for AI data scenarios: scanning training datasets, designing a label taxonomy for AI use cases, enforcing DLP policies against AI pipelines, and integrating with Azure AI Foundry.
The Training Data Blind Spot Your DLP Policy Does Not Cover
Six months ago, a financial services team finished fine-tuning a credit risk model. The model performed well. The post-training audit did not. When the security team pulled the ADLS Gen2 access logs, they found the training dataset included blobs tagged as CONFIDENTIAL in three legacy classification systems, none of which had been integrated with Microsoft Purview. The data scientists did not know. The DLP policies did not fire. The model trained on customer PII, got deployed to production, and nobody flagged it until the compliance team ran a manual review.
This is the gap: AI training pipelines do not look like normal data access patterns. Traditional DLP catches email attachments and SharePoint downloads. It does not catch a Python script calling azure.storage.blob.BlobServiceClient in a Jupyter notebook running in Azure ML. By the time the model is trained, the data has moved.
Microsoft Purview is the right tool to close this gap, but only if you configure it for AI data scenarios specifically. The default Purview setup optimized for M365 documents will miss most of the risk surface in an AI training pipeline.
---
What Purview Actually Covers in an AI Context
Before configuring anything, get clarity on what Purview's different capabilities actually protect. The product has evolved significantly and the branding conflates several distinct things: Microsoft Purview Data Map scans data sources (ADLS Gen2, SQL, Blob, Fabric, etc.) and builds a catalog with sensitivity classifications. This is your starting point for AI data governance. Microsoft Purview Information Protection manages sensitivity labels at the Microsoft 365 layer, including labels applied to files in Azure Storage and Azure AI Foundry data connections. Microsoft Purview Data Loss Prevention enforces policies based on those labels, preventing exfiltration or unauthorized movement. Microsoft Purview Data Governance (formerly the Governance Portal) manages data products, access policies, and data use governance that integrates directly with Azure AI Foundry.
For AI training data protection, you need all four working together. Most teams have partial coverage and think they have full coverage.
---
Registering AI Data Sources in Purview Data Map
The first step is getting your AI data sources into the Purview catalog. For most Azure AI pipelines, that means ADLS Gen2 (raw training data), Azure Blob (processed datasets and model artifacts), and Azure SQL or Synapse (structured training features).
Registering ADLS Gen2
# Create a Purview account
az purview account create \
--name <purview-account-name> \
--resource-group <rg> \
--location eastus# Grant the Purview managed identity read access to the training data storage
PURVIEW_MI=$(az purview account show \
--name <purview-account-name> \
--resource-group <rg> \
--query "identity.principalId" --output tsv)
az role assignment create \
--role "Storage Blob Data Reader" \
--assignee "$PURVIEW_MI" \
--scope /subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/<account>
Grant the Purview managed identity Storage Blob Data Reader on the ADLS account before running scans. Without this, scans silently return zero assets rather than failing with a permission error.
Configuring Scan Rules for AI Training Data
Default scan rules focus on M365 content types like credit card numbers and Social Security Numbers. For AI training data, you need additional classifiers and file type coverage. A default scan rule set does not scan .parquet or .jsonl files — which are where most structured ML training data lives:
# Create a custom scan rule set targeting AI training data file types and classifiers
# Use the Purview REST API via CLI:
curl -X PUT "https://<purview-account>.purview.azure.com/scan/scanrulesets/ai-training-scan-rules?api-version=2023-09-01" \
-H "Authorization: Bearer $(az account get-access-token --resource https://purview.azure.com --query accessToken --output tsv)" \
-H "Content-Type: application/json" \
-d '{
"kind": "AdlsGen2",
"properties": {
"scanningRule": {
"fileExtensions": [
"CSV", "PARQUET", "JSONL", "JSON", "AVRO", "ORC",
"TSV", "TXT", "XLSX"
],
"includedSystemClassifications": [
"MICROSOFT.PERSONAL.NAME",
"MICROSOFT.PERSONAL.EMAIL",
"MICROSOFT.FINANCIAL.CREDIT_CARD",
"MICROSOFT.GOVERNMENT.US.SOCIAL_SECURITY_NUMBER",
"MICROSOFT.PERSONAL.IPADDRESS",
"MICROSOFT.PERSONAL.DATE_OF_BIRTH",
"MICROSOFT.PERSONAL.PHONE_NUMBER"
]
}
}
}'The .parquet and .jsonl extensions are critical. Hugging Face datasets, BERT training corpora, and most structured ML datasets are in these formats. Without extending the scan rule set, a scan of your training data lake will return results showing zero sensitive data — a false negative, not a clean result.
---
Sensitivity Label Taxonomy for AI Training Data
Standard information protection label taxonomies work for documents but create practical problems for AI training data. A dataset that's "Confidential" in document terms might be fine for internal model training but should never appear in a RAG pipeline that serves external users. The label needs to encode both the sensitivity level and the permitted AI use case.
Design a sub-label hierarchy specifically for AI use cases:
| Label | Sub-label | Permitted AI Use | Examples |
|---|---|---|---|
| Confidential | AI-Training-Approved | Internal model training; no external serving | Anonymized customer features, synthetic PII |
| Confidential | AI-RAG-Approved | RAG grounding for internal apps only | Internal knowledge base, HR policies |
| Confidential | AI-Prohibited | No AI use permitted | Raw customer PII, financial records |
| Highly Confidential | AI-Prohibited | No AI use permitted | Regulated data, trade secrets |
| Public | AI-Training-Approved | Any AI use | Public datasets, open-source corpora |
# Connect to Compliance Center
Connect-IPPSSession# Create the AI-Training-Approved sub-label under Confidential
New-Label `
-Name "Confidential-AI-Training-Approved" `
-DisplayName "Confidential - AI Training Approved" `
-ParentId "Confidential" `
-Tooltip "For datasets approved for internal AI model training. Not permitted for RAG pipelines serving external users."
# Create the AI-Prohibited sub-label
New-Label `
-Name "Confidential-AI-Prohibited" `
-DisplayName "Confidential - AI Prohibited" `
-ParentId "Confidential" `
-Tooltip "Contains PII or regulated data. Must not be used for AI training, fine-tuning, or RAG grounding."
# Publish labels to Azure Storage and Azure Synapse scopes
New-LabelPolicy `
-Name "AI-Training-Data-Labels" `
-Labels @("Confidential-AI-Training-Approved", "Confidential-AI-Prohibited", "Confidential-AI-RAG-Approved") `
-ExchangeLocation None `
-SharePointLocation None `
-ModernGroupLocation None
After creating labels, publish them explicitly to the Azure Storage locations where your training data lives. Publishing scope matters: if you publish only to M365 workloads, blobs never receive labels from automated classification.
---
DLP Policies for AI Training Pipelines
With labels applied to training data, you can now enforce policies that control what pipelines can do with that data.
Policy 1: Alert on Large-Scale Extracts of Confidential Training Data
Even for approved training data, bulk extraction events should trigger an alert. A DLP policy targeting volume thresholds catches data scientists who pull full datasets to local machines outside of approved pipeline infrastructure:
In the Purview compliance portal, create an endpoint DLP policy with:
- Condition: content contains
Confidential - AI Training Approvedlabel - Condition: activity is "Copy to cloud storage" or "Upload to network share"
- Action: Audit and alert (do not block, since this is an approved label)
- Alert threshold: more than 5 files in 15 minutes
Policy 2: Block AI-Prohibited Data from AI Foundry and Azure ML Principals
Create a service endpoint DLP policy that blocks any blob containing an AI-Prohibited label from being accessed by Azure ML compute or AI Foundry service principals. In the Purview compliance portal:
- Policy scope: Azure Storage
- Condition: sensitivity label is
Confidential - AI ProhibitedorHighly Confidential - AI Prohibited - Condition: accessed by service principals matching the Azure ML or AI Foundry workspace managed identity patterns
- Action: Block and alert
This policy requires the storage accounts to have Purview data use governance enabled (covered in the next section).
Policy 3: Tag Model Artifacts with Source Dataset Label
Fine-tuned models trained on confidential data should inherit the source dataset's classification. Add a tagging step to your Azure ML training pipeline:
# Add to training job script, after model save
from azure.ai.ml import MLClient, Model
from azure.identity import DefaultAzureCredentialml_client = MLClient(DefaultAzureCredential(), subscription_id, resource_group, workspace_name)
# Tag the model version with the highest sensitivity label from training data
ml_client.models.create_or_update(
Model(
name="credit-risk-model",
version="2.1",
path="./outputs/model",
tags={
"sensitivity_label": "Confidential-AI-Training-Approved",
"training_data_classification": "CONFIDENTIAL",
"ai_serving_permitted": "internal-only",
"purview_scan_id": training_scan_id # Link to the Purview scan of training data
}
)
)
This tagging approach creates an auditable trail and enables downstream policy enforcement when the model is deployed to AI Foundry or an Azure ML online endpoint. It does not automatically block serving externally, but it provides the metadata that access control policies can act on.
---
Azure AI Foundry Integration
The most direct integration between Purview and Azure AI Foundry is through data use governance. When a Foundry project connects to a storage source, Purview can enforce label-based access policies through this workflow.
To enable data use governance on a storage account:
# Enable data use governance on the training data storage account
# This must be done via the Purview portal under Data Map > Sources > [source] > Edit
# Or via the Purview REST API:PURVIEW_TOKEN=$(az account get-access-token --resource https://purview.azure.com --query accessToken --output tsv)
# Enable data use governance on the registered source
curl -X PUT "https://<purview-account>.purview.azure.com/scan/datasources/training-data-adls?api-version=2023-09-01" \
-H "Authorization: Bearer $PURVIEW_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"kind": "AdlsGen2",
"properties": {
"endpoint": "https://<storage-account>.dfs.core.windows.net",
"dataUseGovernance": "Enabled",
"collection": {
"referenceName": "AI-Training-Data",
"type": "CollectionReference"
}
}
}'
# Grant the AI Foundry hub managed identity the Purview Data Reader role
FOUNDRY_MI_OBJECT_ID="<hub-managed-identity-object-id>"
az role assignment create \
--role "Purview Data Reader" \
--assignee "$FOUNDRY_MI_OBJECT_ID" \
--scope /subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Purview/accounts/<purview-account>
With data use governance enabled, Purview becomes an authorization layer for Foundry data connections. Access requests for governed data sources flow through Purview's access request workflow rather than being granted directly via Azure RBAC.
What Labels Do Not Do Automatically
Labels applied to blobs do not block reads at the Azure Storage API layer. The blob read still succeeds for anyone with Storage Blob Data Reader. Labels are metadata that policies act on: they are only protective if you have wired up enforcement.
The enforcement chain is: Purview label applied to blob by scan or manual tagging, then DLP policy evaluates the label, then the policy action (block, alert, or audit) fires. All three links must be in place. A label without a policy is just a tag.
For how Purview sensitivity labels integrate with the broader AI Foundry security model, see the Azure AI Foundry security and RBAC governance guide.
---
KQL Detection: Monitoring Classification Gaps
Detecting Access to Unclassified Blobs from AI Workloads
StorageBlobLogs
| where TimeGenerated > ago(7d)
| where OperationName in ("GetBlob", "PutBlob", "ListBlobs")
| where AuthenticationType contains "OAuth"
| extend CallerApp = tostring(UserAgentHeader)
| where CallerApp contains "azure-ai" or CallerApp contains "azureml" or CallerApp contains "aml-"
| extend BlobPath = strcat(AccountName, "/", tostring(split(Uri, "/")[3]))
| summarize AccessCount = count(), CallerIps = make_set(CallerIpAddress)
by BlobPath, bin(TimeGenerated, 1h)
| where AccessCount > 20
| order by AccessCount descAlert threshold: more than 20 blob reads per hour by Azure ML or AI Foundry service principals. This surfaces training jobs pulling datasets that have not been scanned by Purview. Cross-reference the blob paths against your Purview asset catalog to confirm classification status.
Detecting Large-Scale Extracts of Sensitive Training Data
StorageBlobLogs
| where TimeGenerated > ago(24h)
| where OperationName == "GetBlob"
| where ResponseBodySize > 10000000
| summarize TotalBytesRead = sum(ResponseBodySize),
BlobCount = count(),
Callers = make_set(CallerIpAddress)
by CallerIpAddress, bin(TimeGenerated, 30m)
| where TotalBytesRead > 1073741824
| project TimeGenerated, CallerIpAddress, TotalGBRead = round(TotalBytesRead / 1073741824.0, 2), BlobCount, Callers
| order by TotalGBRead descAlert threshold: any caller reading more than 1GB in 30 minutes. This catches ad hoc extraction of full training datasets outside of approved pipeline infrastructure. At that volume, you want to confirm it is an approved pipeline job and not an engineer pulling data to a personal machine.
---
Common Configuration Failures
Scan Credentials Not Persisted After Platform Rotation
Purview scan credentials (the managed identity or service principal that reads your data sources) need to be re-verified when the underlying identity rotates. A common failure mode: the MI used for scanning gets replaced during a platform team rotation, scans start returning zero results, and nobody notices for weeks because there is no scan failure alert. Audit scan run history monthly and set up a Logic App alert on scan runs that return zero assets in sources with known data.
Label Inheritance Disabled on Containers
By default, Purview does not inherit labels from parent containers to child blobs. If you label a container as Confidential - AI-Training-Approved, individual blobs inside it do not receive that label unless you explicitly enable label inheritance in the scan rule set. Most teams enable the container label and then discover the blobs themselves are unlabeled when they run a DLP report.
ADLS Gen2 Registered as Blob Storage
If your storage account has hierarchical namespace (HNS) enabled, you must register it in Purview as AdlsGen2, not AzureBlobStorage. Registering an HNS storage account as blob results in incomplete directory traversal: training data organized in directory trees (which is universal for ML datasets) will have most assets missed. Check the storage account's HNS setting and match the Purview source kind accordingly.
---
Hardening Checklist
- [ ] Purview account deployed; managed identity granted
Storage Blob Data Readeron all AI training storage accounts - [ ] ADLS Gen2 sources registered with
AdlsGen2kind, notAzureBlobStorage; HNS traversal confirmed working - [ ] Scan rule sets extended to include
.parquet,.jsonl,.avro,.orc,.csv,.tsvfile types - [ ] Personal data classifiers explicitly listed in scan rule set (name, email, SSN, phone, IP, DOB)
- [ ] Sensitivity label sub-taxonomy created for AI use cases:
AI-Training-Approved,AI-RAG-Approved,AI-Prohibited - [ ] Labels published to Azure Storage scope, not only to M365 workloads
- [ ] Label inheritance enabled at container level in scan rule sets
- [ ] DLP alert policy configured for bulk extraction of
AI-Training-Approveddata (threshold: more than 5 files per 15 minutes) - [ ] DLP block policy configured for
AI-Prohibiteddata accessed by Azure ML or AI Foundry service principals - [ ] Data use governance enabled on all training data storage accounts
- [ ] AI Foundry hub managed identity granted Purview Data Reader role
- [ ] AI training pipeline scripts tag model artifacts with source dataset sensitivity label and Purview scan ID
- [ ] KQL alert deployed for unclassified blob access by AI workload principals (threshold: more than 20 reads per hour)
- [ ] KQL alert deployed for large-scale extracts (threshold: more than 1GB per 30 minutes per caller)
- [ ] Monthly audit of Purview scan run history to detect broken scan credentials
- [ ] Foundry project data connections reviewed against Purview governed sources list quarterly
Get weekly security insights
Cloud security, zero trust, and identity guides — straight to your inbox.
Microsoft Cloud Solution Architect
Cloud Solution Architect with deep expertise in Microsoft Azure and a strong background in systems and IT infrastructure. Passionate about cloud technologies, security best practices, and helping organizations modernize their infrastructure.
Questions & Answers
Related Articles
MCP Server Hardening Case Study: Locking Down a Corporate Dev Environment
22 min read
Azure AI Foundry Security: Threat Model, RBAC, and Data Governance Controls (2026)
20 min read
Azure AI Foundry Private Link Setup: Secure Azure OpenAI, AI Search, and Storage End-to-End
18 min read
Need Help with Your Security?
Our team of security experts can help you implement the strategies discussed in this article.
Contact Us