Cyber Intelligence
AI Security17 min read

Microsoft Purview for AI Governance: Classifying and Protecting AI Training Data

AI training pipelines bypass traditional DLP controls because they access data as bulk blob reads, not document downloads. This guide shows how to configure Microsoft Purview specifically for AI data scenarios: scanning training datasets, designing a label taxonomy for AI use cases, enforcing DLP policies against AI pipelines, and integrating with Azure AI Foundry.

I
Microsoft Cloud Solution Architect
Microsoft Purview AI governance infographic showing training data classification, sensitivity labels, DLP policy enforcement, Azure AI Foundry integration, and compliance controls
Microsoft PurviewAI GovernanceData ClassificationDLPAzure AI FoundrySensitivity LabelsComplianceCloud Security

The Training Data Blind Spot Your DLP Policy Does Not Cover

Six months ago, a financial services team finished fine-tuning a credit risk model. The model performed well. The post-training audit did not. When the security team pulled the ADLS Gen2 access logs, they found the training dataset included blobs tagged as CONFIDENTIAL in three legacy classification systems, none of which had been integrated with Microsoft Purview. The data scientists did not know. The DLP policies did not fire. The model trained on customer PII, got deployed to production, and nobody flagged it until the compliance team ran a manual review.

This is the gap: AI training pipelines do not look like normal data access patterns. Traditional DLP catches email attachments and SharePoint downloads. It does not catch a Python script calling azure.storage.blob.BlobServiceClient in a Jupyter notebook running in Azure ML. By the time the model is trained, the data has moved.

Microsoft Purview is the right tool to close this gap, but only if you configure it for AI data scenarios specifically. The default Purview setup optimized for M365 documents will miss most of the risk surface in an AI training pipeline.

What Purview Actually Covers in an AI Context

Before configuring anything, get clarity on what Purview's different capabilities actually protect. The product has evolved significantly and the branding conflates several distinct things:

Microsoft Purview Data Map scans data sources (ADLS Gen2, SQL, Blob, Fabric, etc.) and builds a catalog with sensitivity classifications. This is your starting point for AI data governance.

Microsoft Purview Information Protection manages sensitivity labels at the Microsoft 365 layer, including labels applied to files in Azure Storage and Azure AI Foundry data connections.

Microsoft Purview Data Loss Prevention enforces policies based on those labels, preventing exfiltration or unauthorized movement.

Microsoft Purview Data Governance (formerly the Governance Portal) manages data products, access policies, and data use governance that integrates directly with Azure AI Foundry.

For AI training data protection, you need all four working together. Most teams have partial coverage and think they have full coverage.

Registering AI Data Sources in Purview Data Map

The first step is getting your AI data sources into the Purview catalog. For most Azure AI pipelines, that means ADLS Gen2 (raw training data), Azure Blob (processed datasets and model artifacts), and Azure SQL or Synapse (structured training features).

Registering ADLS Gen2

# Create a Purview account
az purview account create \
  --name <purview-account-name> \
  --resource-group <rg> \
  --location eastus

# Grant the Purview managed identity read access to the training data storage
PURVIEW_MI=$(az purview account show \
  --name <purview-account-name> \
  --resource-group <rg> \
  --query "identity.principalId" --output tsv)

az role assignment create \
  --role "Storage Blob Data Reader" \
  --assignee "$PURVIEW_MI" \
  --scope /subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/<account>

Grant the Purview managed identity Storage Blob Data Reader on the ADLS account before running scans. Without this, scans silently return zero assets rather than failing with a permission error.

Configuring Scan Rules for AI Training Data

Default scan rules focus on M365 content types like credit card numbers and Social Security Numbers. For AI training data, you need additional classifiers and file type coverage. A default scan rule set does not scan .parquet or .jsonl files: which are where most structured ML training data lives:

# Create a custom scan rule set targeting AI training data file types and classifiers
# Use the Purview REST API via CLI:
curl -X PUT "https://<purview-account>.purview.azure.com/scan/scanrulesets/ai-training-scan-rules?api-version=2023-09-01" \
  -H "Authorization: Bearer $(az account get-access-token --resource https://purview.azure.com --query accessToken --output tsv)" \
  -H "Content-Type: application/json" \
  -d '{
    "kind": "AdlsGen2",
    "properties": {
      "scanningRule": {
        "fileExtensions": [
          "CSV", "PARQUET", "JSONL", "JSON", "AVRO", "ORC",
          "TSV", "TXT", "XLSX"
        ],
        "includedSystemClassifications": [
          "MICROSOFT.PERSONAL.NAME",
          "MICROSOFT.PERSONAL.EMAIL",
          "MICROSOFT.FINANCIAL.CREDIT_CARD",
          "MICROSOFT.GOVERNMENT.US.SOCIAL_SECURITY_NUMBER",
          "MICROSOFT.PERSONAL.IPADDRESS",
          "MICROSOFT.PERSONAL.DATE_OF_BIRTH",
          "MICROSOFT.PERSONAL.PHONE_NUMBER"
        ]
      }
    }
  }'

The .parquet and .jsonl extensions are critical. Hugging Face datasets, BERT training corpora, and most structured ML datasets are in these formats. Without extending the scan rule set, a scan of your training data lake will return results showing zero sensitive data: a false negative, not a clean result.

Sensitivity Label Taxonomy for AI Training Data

Standard information protection label taxonomies work for documents but create practical problems for AI training data. A dataset that's "Confidential" in document terms might be fine for internal model training but should never appear in a RAG pipeline that serves external users. The label needs to encode both the sensitivity level and the permitted AI use case.

Design a sub-label hierarchy specifically for AI use cases:

LabelSub-labelPermitted AI UseExamples
ConfidentialAI-Training-ApprovedInternal model training; no external servingAnonymized customer features, synthetic PII
ConfidentialAI-RAG-ApprovedRAG grounding for internal apps onlyInternal knowledge base, HR policies
ConfidentialAI-ProhibitedNo AI use permittedRaw customer PII, financial records
Highly ConfidentialAI-ProhibitedNo AI use permittedRegulated data, trade secrets
PublicAI-Training-ApprovedAny AI usePublic datasets, open-source corpora

Implement this via PowerShell against the Compliance Center:

# Connect to Compliance Center
Connect-IPPSSession

# Create the AI-Training-Approved sub-label under Confidential
New-Label `
  -Name "Confidential-AI-Training-Approved" `
  -DisplayName "Confidential - AI Training Approved" `
  -ParentId "Confidential" `
  -Tooltip "For datasets approved for internal AI model training. Not permitted for RAG pipelines serving external users."

# Create the AI-Prohibited sub-label
New-Label `
  -Name "Confidential-AI-Prohibited" `
  -DisplayName "Confidential - AI Prohibited" `
  -ParentId "Confidential" `
  -Tooltip "Contains PII or regulated data. Must not be used for AI training, fine-tuning, or RAG grounding."

# Publish labels to Azure Storage and Azure Synapse scopes
New-LabelPolicy `
  -Name "AI-Training-Data-Labels" `
  -Labels @("Confidential-AI-Training-Approved", "Confidential-AI-Prohibited", "Confidential-AI-RAG-Approved") `
  -ExchangeLocation None `
  -SharePointLocation None `
  -ModernGroupLocation None

After creating labels, publish them explicitly to the Azure Storage locations where your training data lives. Publishing scope matters: if you publish only to M365 workloads, blobs never receive labels from automated classification.

DLP Policies for AI Training Pipelines

With labels applied to training data, you can now enforce policies that control what pipelines can do with that data.

Policy 1: Alert on Large-Scale Extracts of Confidential Training Data

Even for approved training data, bulk extraction events should trigger an alert. A DLP policy targeting volume thresholds catches data scientists who pull full datasets to local machines outside of approved pipeline infrastructure:

In the Purview compliance portal, create an endpoint DLP policy with:

  • Condition: content contains Confidential - AI Training Approved label
  • Condition: activity is "Copy to cloud storage" or "Upload to network share"
  • Action: Audit and alert (do not block, since this is an approved label)
  • Alert threshold: more than 5 files in 15 minutes

Policy 2: Block AI-Prohibited Data from AI Foundry and Azure ML Principals

Create a service endpoint DLP policy that blocks any blob containing an AI-Prohibited label from being accessed by Azure ML compute or AI Foundry service principals. In the Purview compliance portal:

  • Policy scope: Azure Storage
  • Condition: sensitivity label is Confidential - AI Prohibited or Highly Confidential - AI Prohibited
  • Condition: accessed by service principals matching the Azure ML or AI Foundry workspace managed identity patterns
  • Action: Block and alert

This policy requires the storage accounts to have Purview data use governance enabled (covered in the next section).

Policy 3: Tag Model Artifacts with Source Dataset Label

Fine-tuned models trained on confidential data should inherit the source dataset's classification. Add a tagging step to your Azure ML training pipeline:

# Add to training job script, after model save
from azure.ai.ml import MLClient, Model
from azure.identity import DefaultAzureCredential

ml_client = MLClient(DefaultAzureCredential(), subscription_id, resource_group, workspace_name)

# Tag the model version with the highest sensitivity label from training data
ml_client.models.create_or_update(
    Model(
        name="credit-risk-model",
        version="2.1",
        path="./outputs/model",
        tags={
            "sensitivity_label": "Confidential-AI-Training-Approved",
            "training_data_classification": "CONFIDENTIAL",
            "ai_serving_permitted": "internal-only",
            "purview_scan_id": training_scan_id  # Link to the Purview scan of training data
        }
    )
)

This tagging approach creates an auditable trail and enables downstream policy enforcement when the model is deployed to AI Foundry or an Azure ML online endpoint. It does not automatically block serving externally, but it provides the metadata that access control policies can act on.

Azure AI Foundry Integration

The most direct integration between Purview and Azure AI Foundry is through data use governance. When a Foundry project connects to a storage source, Purview can enforce label-based access policies through this workflow.

To enable data use governance on a storage account:

# Enable data use governance on the training data storage account
# This must be done via the Purview portal under Data Map > Sources > [source] > Edit
# Or via the Purview REST API:

PURVIEW_TOKEN=$(az account get-access-token --resource https://purview.azure.com --query accessToken --output tsv)

# Enable data use governance on the registered source
curl -X PUT "https://<purview-account>.purview.azure.com/scan/datasources/training-data-adls?api-version=2023-09-01" \
  -H "Authorization: Bearer $PURVIEW_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "kind": "AdlsGen2",
    "properties": {
      "endpoint": "https://<storage-account>.dfs.core.windows.net",
      "dataUseGovernance": "Enabled",
      "collection": {
        "referenceName": "AI-Training-Data",
        "type": "CollectionReference"
      }
    }
  }'

# Grant the AI Foundry hub managed identity the Purview Data Reader role
FOUNDRY_MI_OBJECT_ID="<hub-managed-identity-object-id>"
az role assignment create \
  --role "Purview Data Reader" \
  --assignee "$FOUNDRY_MI_OBJECT_ID" \
  --scope /subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Purview/accounts/<purview-account>

With data use governance enabled, Purview becomes an authorization layer for Foundry data connections. Access requests for governed data sources flow through Purview's access request workflow rather than being granted directly via Azure RBAC.

What Labels Do Not Do Automatically

Labels applied to blobs do not block reads at the Azure Storage API layer. The blob read still succeeds for anyone with Storage Blob Data Reader. Labels are metadata that policies act on: they are only protective if you have wired up enforcement.

The enforcement chain is: Purview label applied to blob by scan or manual tagging, then DLP policy evaluates the label, then the policy action (block, alert, or audit) fires. All three links must be in place. A label without a policy is just a tag.

For how Purview sensitivity labels integrate with the broader AI Foundry security model, see the [Azure AI Foundry security and RBAC governance guide](/blog/azure-ai-foundry-security-threat-model-rbac-governance).

KQL Detection: Monitoring Classification Gaps

Detecting Access to Unclassified Blobs from AI Workloads

StorageBlobLogs
| where TimeGenerated > ago(7d)
| where OperationName in ("GetBlob", "PutBlob", "ListBlobs")
| where AuthenticationType contains "OAuth"
| extend CallerApp = tostring(UserAgentHeader)
| where CallerApp contains "azure-ai" or CallerApp contains "azureml" or CallerApp contains "aml-"
| extend BlobPath = strcat(AccountName, "/", tostring(split(Uri, "/")[3]))
| summarize AccessCount = count(), CallerIps = make_set(CallerIpAddress)
  by BlobPath, bin(TimeGenerated, 1h)
| where AccessCount > 20
| order by AccessCount desc

Alert threshold: more than 20 blob reads per hour by Azure ML or AI Foundry service principals. This surfaces training jobs pulling datasets that have not been scanned by Purview. Cross-reference the blob paths against your Purview asset catalog to confirm classification status.

Detecting Large-Scale Extracts of Sensitive Training Data

StorageBlobLogs
| where TimeGenerated > ago(24h)
| where OperationName == "GetBlob"
| where ResponseBodySize > 10000000
| summarize TotalBytesRead = sum(ResponseBodySize),
            BlobCount = count(),
            Callers = make_set(CallerIpAddress)
  by CallerIpAddress, bin(TimeGenerated, 30m)
| where TotalBytesRead > 1073741824
| project TimeGenerated, CallerIpAddress, TotalGBRead = round(TotalBytesRead / 1073741824.0, 2), BlobCount, Callers
| order by TotalGBRead desc

Alert threshold: any caller reading more than 1GB in 30 minutes. This catches ad hoc extraction of full training datasets outside of approved pipeline infrastructure. At that volume, you want to confirm it is an approved pipeline job and not an engineer pulling data to a personal machine.

Common Configuration Failures

Scan Credentials Not Persisted After Platform Rotation

Purview scan credentials (the managed identity or service principal that reads your data sources) need to be re-verified when the underlying identity rotates. A common failure mode: the MI used for scanning gets replaced during a platform team rotation, scans start returning zero results, and nobody notices for weeks because there is no scan failure alert. Audit scan run history monthly and set up a Logic App alert on scan runs that return zero assets in sources with known data.

Label Inheritance Disabled on Containers

By default, Purview does not inherit labels from parent containers to child blobs. If you label a container as Confidential - AI-Training-Approved, individual blobs inside it do not receive that label unless you explicitly enable label inheritance in the scan rule set. Most teams enable the container label and then discover the blobs themselves are unlabeled when they run a DLP report.

ADLS Gen2 Registered as Blob Storage

If your storage account has hierarchical namespace (HNS) enabled, you must register it in Purview as AdlsGen2, not AzureBlobStorage. Registering an HNS storage account as blob results in incomplete directory traversal: training data organized in directory trees (which is universal for ML datasets) will have most assets missed. Check the storage account's HNS setting and match the Purview source kind accordingly.

Hardening Checklist

  • [ ] Purview account deployed; managed identity granted Storage Blob Data Reader on all AI training storage accounts
  • [ ] ADLS Gen2 sources registered with AdlsGen2 kind, not AzureBlobStorage; HNS traversal confirmed working
  • [ ] Scan rule sets extended to include .parquet, .jsonl, .avro, .orc, .csv, .tsv file types
  • [ ] Personal data classifiers explicitly listed in scan rule set (name, email, SSN, phone, IP, DOB)
  • [ ] Sensitivity label sub-taxonomy created for AI use cases: AI-Training-Approved, AI-RAG-Approved, AI-Prohibited
  • [ ] Labels published to Azure Storage scope, not only to M365 workloads
  • [ ] Label inheritance enabled at container level in scan rule sets
  • [ ] DLP alert policy configured for bulk extraction of AI-Training-Approved data (threshold: more than 5 files per 15 minutes)
  • [ ] DLP block policy configured for AI-Prohibited data accessed by Azure ML or AI Foundry service principals
  • [ ] Data use governance enabled on all training data storage accounts
  • [ ] AI Foundry hub managed identity granted Purview Data Reader role
  • [ ] AI training pipeline scripts tag model artifacts with source dataset sensitivity label and Purview scan ID
  • [ ] KQL alert deployed for unclassified blob access by AI workload principals (threshold: more than 20 reads per hour)
  • [ ] KQL alert deployed for large-scale extracts (threshold: more than 1GB per 30 minutes per caller)
  • [ ] Monthly audit of Purview scan run history to detect broken scan credentials
  • [ ] Foundry project data connections reviewed against Purview governed sources list quarterly

Frequently Asked Questions

Why do traditional DLP controls fail to detect data exfiltration by AI training pipelines?

Traditional DLP policies detect data movement patterns like file downloads, email attachments, and cloud uploads. An AI training pipeline accesses data as bulk blob reads via the Azure Storage API, which appears in storage telemetry but is not intercepted by DLP rules designed for file-transfer events. The pipeline uses a managed identity or service principal that already has authorized read access to the storage account, so the access itself is not anomalous by standard DLP criteria. Detecting this pattern requires monitoring storage API access logs for high-volume reads by AI workload principals and correlating against Purview asset classification to identify unclassified data being accessed.

What is the AI-Training-Approved sensitivity label sub-taxonomy and why is a separate label needed for AI use cases?

The standard sensitivity label taxonomy (Public, Internal, Confidential, Highly Confidential) does not convey whether data is appropriate for use as AI training input. An Internal document may be appropriate for human employees to read but inappropriate to include in a training dataset because it contains personally identifiable information or pending litigation content. The AI sub-taxonomy adds labels such as AI-Training-Approved, AI-RAG-Approved, and AI-Prohibited on top of the base classification to control specifically which data AI pipelines are permitted to access, independent of whether human access to the document is appropriate.

Why must ADLS Gen2 storage accounts be registered as AdlsGen2 in Purview rather than as AzureBlobStorage?

Azure Data Lake Storage Gen2 uses a hierarchical namespace (HNS) that organizes blobs in directory trees rather than flat container structures. When registered as AzureBlobStorage, Purview's scanner traverses the container at the top level and misses blobs organized in subdirectories, which is universal for machine learning datasets. Registering as AdlsGen2 enables directory-aware traversal so the scanner finds all assets, including training files organized in category or year-based subdirectories. The HNS setting on the storage account is the indicator: if HNS is enabled, the Purview source kind must be AdlsGen2.

What is data use governance in the context of Azure AI Foundry and Purview integration?

Data use governance is an Azure Machine Learning and Foundry feature that marks storage accounts as governed sources, requiring that any data connection from a Foundry project to that storage account be reviewed against Purview classification before the connection is approved. Enabling it creates a workflow where a data steward can approve or reject a Foundry project's request to access a governed data source based on the Purview classification of the data in that account. This provides a review gate that prevents AI projects from silently connecting to storage accounts that contain regulated or prohibited training data.

How should organizations handle Purview scan credentials that rotate as part of platform maintenance?

The most common failure mode is that the managed identity used for Purview scanning is replaced during a platform team rotation, after which scans return zero results because the new identity does not have Storage Blob Data Reader access on the scanned accounts. The detection gap can persist for weeks if there is no alert on scan failures. The operational controls are: an automated monthly audit of Purview scan run history that alerts on runs returning zero assets in known data-containing sources, a change management requirement that any platform MI rotation includes updating Purview scan credential registrations, and a Logic App alert triggered by Purview scan runs that complete with zero new assets on a source with previously scanned data.

N

Recommended tool: Nordpass

Up to 40% commission

Get weekly security insights

Cloud security, zero trust, and identity guides — straight to your inbox.

I

Microsoft Cloud Solution Architect

Cloud Solution Architect with deep expertise in Microsoft Azure and a strong background in systems and IT infrastructure. Passionate about cloud technologies, security best practices, and helping organizations modernize their infrastructure.

Share this article

Questions & Answers

Related Articles

Need Help with Your Security?

Our team of security experts can help you implement the strategies discussed in this article.

Contact Us