Skip to main content

Uniqueness: Configuration Scenarios

Three practical walkthroughs showing how to configure DQS uniqueness analysis for different business needs.

What These Scenarios Cover

This page walks through three real-world configurations of DQS uniqueness analysis. Each scenario covers a specific business problem, shows the exact settings to use, and explains how to read the results.

These walkthroughs build on the concepts from the main Uniqueness article. Read that first if you are new to uniqueness metrics, the diagnostic layers, or the difference between Basic Uniqueness and Advanced Uniqueness Analysis.

Scenario 1: Email Deduplication Audit on Leads

The Problem

Your marketing team runs nurture campaigns through Salesforce. Open rates are declining, and the email platform reports a rising number of “duplicate sends”: the same person receiving the same email twice. Your duplicate management rules catch exact-match records, but partial duplicates slip through. Two Lead records for the same person with the same email address both receive the campaign. You need a concrete number: how many Lead email addresses are shared across multiple records?

Configuration

This is a straightforward duplicate detection check. Use Basic Uniqueness mode on the Lead object, targeting the Email field.

SettingValueWhy
Analysis ModeBasic UniquenessYou need the duplication rate and distinct count, not distribution or boilerplate analysis
Case SensitiveOFFEmail addresses are case-insensitive. “John@Company.com” and “john@company.com” are the same address.
Include BlanksONA blank email on a Lead is a problem worth quantifying. Including blanks means all empty email records share one “blank” value, lowering the Uniqueness Rate and making the gap visible.

Case Sensitive OFF is the default and the correct choice for email. If two records store “jsmith@acme.com” and “JSmith@Acme.com”, those are the same address. Enabling case sensitivity would count them as distinct and hide the duplicate.

Sample Results

Foundation Metrics:

MetricValue
Uniqueness Rate74%
Distinct Count18,500

Total Lead records evaluated: 25,000.

Reading the Results

Start with the headline: 74% uniqueness. That means 26% of email addresses appear on more than one Lead record. Of 25,000 Leads, only 18,500 distinct email addresses exist. The gap of 6,500 records is shared email addresses.

What 26% duplicate emails look like in practice. Some are legitimate: department addresses like info@company.com or sales@company.com shared across multiple contacts at the same company. Most are duplicate Leads created by different sources. A web form creates one Lead. A list import creates another. A sales rep creates a third from a business card. All three have the same email address.

Include Blanks ON reveals the full picture. With Include Blanks enabled, Leads with no email address all share a single “blank” value. If 2,000 of the 25,000 Leads have no email, those 2,000 records count as duplicates of each other. This lowers the Uniqueness Rate compared to excluding blanks, but it gives you the honest number. Your campaign can reach 18,500 distinct addresses at best, not 25,000.

Why Basic Uniqueness is enough here. The question is “how many emails are duplicated?” Uniqueness Rate and Distinct Count answer that question. You do not need Entropy or Rarity to decide whether to launch a deduplication project. If you later want to understand the distribution pattern (how many emails appear exactly twice vs ten times), switch to Advanced Uniqueness Analysis for the full picture.

What to Do Next

Use Distinct Count (18,500) as your real addressable audience for email campaigns. Scope a deduplication project for the records with shared emails. Start by exporting Leads grouped by email address, then merge or delete the duplicates. After cleanup, run the scan again and track Uniqueness Rate over time. If it drops between scans, a new duplicate source has appeared: a list import, a web form without dedup logic, or an integration creating records without checking for existing ones.


Scenario 2: Industry Field Distribution on Accounts

The Problem

Your data team built an Account segmentation model that groups customers by Industry. The model uses 24 industry picklist values to create targeted segments. But the segments are uneven: two segments contain 70% of all Accounts, while the remaining 22 segments split the other 30%. The data science team suspects the Industry field has a distribution problem, not a model problem. You need to confirm whether the field’s value distribution is genuinely skewed and identify the dominant values.

Configuration

Use Advanced Uniqueness Analysis mode on the Account object, targeting the Industry field. You need distribution metrics (Entropy, Max Frequency, Rarity) to answer questions about how values are spread.

SettingValueWhy
Analysis ModeAdvanced Uniqueness AnalysisYou need Entropy, Max Frequency, and Rarity for distribution analysis
Case SensitiveOFFPicklist values are controlled. Case sensitivity is not relevant here.
Include BlanksOFFBlank Industry values are a completeness problem, not a uniqueness problem. Exclude them to focus on the distribution of populated values.

Include Blanks OFF is the right choice for this scenario. You are analyzing how the existing data is distributed across categories. Adding blanks into the calculation would distort the distribution metrics without answering your segmentation question. If you want to know how many Accounts have no Industry value, run a completeness analysis instead.

Sample Results

Foundation Metrics:

MetricValue
Uniqueness Rate0.16%
Distinct Count24

Advanced Metrics:

MetricValue
Entropy2.18
Max Frequency5,200
Rarity0%

Total Account records evaluated: 15,000.

Reading the Results

Uniqueness Rate (0.16%) is expected and irrelevant here. Industry is a picklist with 24 values across 15,000 records. Almost every value is shared by hundreds of records. A low Uniqueness Rate on a picklist field is normal. This metric is not the point of this analysis.

Distinct Count (24) confirms your picklist is intact. All 24 configured values appear in the data. No rogue free-text entries exist. The data is clean from a consistency standpoint.

Entropy (2.18) reveals the skew. Maximum entropy for 24 distinct values is log2(24) = 4.58. Your actual entropy is 2.18. The normalized score is 2.18 / 4.58 = 0.48. That falls well below the 0.7 threshold for “dominated” distributions. A few values hold most of the records. Your data science team’s suspicion is confirmed: the segmentation problem is in the data, not the model.

How to interpret normalized entropy:

Normalized (actual / max)Interpretation
0.9 or aboveEven distribution: values spread uniformly
0.7 to 0.9Moderate skew: some values appear more than others
Below 0.7Dominated: a few values hold most of the records

Your score of 0.48 is in the “dominated” range.

Max Frequency (5,200) identifies the dominant value. One industry value appears on 5,200 of 15,000 records, or 34.7% of the dataset. A quick check reveals it is “Technology.” The second most common value is likely responsible for most of the remaining concentration. Together, two values account for the 70% clustering your team observed.

Rarity (0%) confirms there is no long tail. Every one of the 24 distinct values appears more than once. No singleton values exist. This is expected for a well-controlled picklist field. On a free-text field, you would want to see Rarity to catch typos and one-off entries, but on a picklist, 0% Rarity is normal.

The segmentation verdict: Your 24-category model is really a 2-category system. “Technology” and one other industry dominate the dataset. The remaining 22 categories share 30% of records, giving each category an average of about 200 records. Some segments are too small for meaningful analysis.

What to Do Next

Present Entropy and Max Frequency to your data science team. The numbers confirm the distribution problem. Two options: (1) Redesign the segmentation model to use fewer, broader categories that reflect the actual distribution. Group the 22 smaller industries into 4-5 macro-categories. (2) Enrich the Industry data. If the concentration in “Technology” is inflated because reps default to it during record creation, investigate whether a large portion of those 5,200 records belong to a different industry. Run a periodic scan and track Entropy over time. As you correct misclassified records, Entropy rises toward a healthier distribution.


Scenario 3: Case Description Boilerplate Detection for AI Readiness

The Problem

Your company is evaluating AI-powered case summarization for the support team. The AI tool reads the Description field on Cases and generates a summary for the next agent who picks up the case. Before investing, you need to assess whether your case descriptions contain enough original content for the AI to produce useful summaries. The field is populated on 95% of cases, so completeness is not the concern. The concern is that support agents copy-paste standard templates into every case.

Configuration

Use Advanced Uniqueness Analysis mode on the Case object, targeting the Description field. You need the boilerplate metrics to evaluate content originality.

SettingValueWhy
Analysis ModeAdvanced Uniqueness AnalysisEnables boilerplate detection (Boilerplate Rate, Boilerplate Percentage, Boilerplate Records Count)
Case SensitiveOFFTemplate detection does not depend on casing
Include BlanksOFFEmpty descriptions are a completeness problem. Exclude them to focus on the quality of populated content.

Include Blanks OFF makes sense here because you are evaluating the content that exists, not counting the content that is missing. The 5% of cases with empty descriptions are already handled by your completeness analysis.

Sample Results

Foundation Metrics:

MetricValue
Uniqueness Rate97%
Distinct Count29,100

Advanced Metrics:

MetricValue
Entropy14.8
Boilerplate Rate42%
Boilerplate Percentage68%
Boilerplate Records Count20,400

Total Case records evaluated: 30,000.

Reading the Results

Uniqueness Rate (97%) looks healthy, but it is misleading. Nearly every case description is technically different because each contains unique case numbers, customer names, and dates. The field passes a basic duplication check. But “unique” does not mean “original.”

Boilerplate Rate (42%) tells the real story. 42% of the text content across case descriptions is repetitive or templated. Agents paste standard openings (“Thank you for contacting support. Your case number is…”), standard closings (“Please do not hesitate to reach out if you have further questions.”), and standard diagnostic checklists into every case. The case-specific details fill the middle, but nearly half of every description is copy-paste content.

Boilerplate Percentage (68%) shows how widespread the problem is. 68% of case records contain templated text. That is 20,400 out of 30,000 cases. The boilerplate is not limited to a few agents or one team. It is a systemic pattern embedded in your support process.

Boilerplate Records Count (20,400) is your scope number. If you need to estimate the effort to clean up templates before feeding data to the AI, this is the starting point. 20,400 records contain content that the AI will learn as patterns, but those patterns are your templates, not your customer issues.

The AI readiness verdict: The AI summarization tool will process templated content on 68% of cases. It will learn to summarize your templates, not your customer problems. On the 32% of cases with original content, the AI will perform well. On the 68% with boilerplate, the summaries will echo back the standard phrases agents already know by heart.

Entropy (14.8) is high, confirming that the text is diverse at the character level. This aligns with the 97% Uniqueness Rate: each description is different. Entropy is not the relevant metric here because the duplication problem is not identical values. The problem is repeated content patterns within otherwise unique text. That is exactly what the boilerplate metrics are designed to catch.

What to Do Next

Present Boilerplate Rate (42%) and Boilerplate Percentage (68%) to your AI project stakeholders. The numbers make the case: the AI project needs a content quality improvement phase before deployment. Three approaches to reduce boilerplate:

  • Remove the templates. If agents are pasting standard openings and closings, build those elements into the case layout or a screen flow so they do not pollute the description field. The description then captures only case-specific information.
  • Train agents on effective descriptions. Share examples of high-quality descriptions (from the 32% that are original) and explain why template-free entries produce better AI summaries.
  • Strip boilerplate from historical data. Before feeding existing cases to the AI, run a text processing job that removes known template patterns from the description field.

Run the scan again after each improvement cycle. Track Boilerplate Rate and Boilerplate Percentage as your primary AI readiness metrics for this field. Your target: Boilerplate Percentage below 30% and Boilerplate Rate below 20% before deploying the AI summarization tool.


Choosing Your Configuration

Use this table to pick the right starting point for your uniqueness analysis.

If You Need To…Start WithKey Settings
Audit duplicate values on an identifier field (Email, Phone, Account Name)Basic UniquenessCase Sensitive: OFF, Include Blanks: ON to reveal blank volume
Size a deduplication project with a concrete record countBasic UniquenessUse Distinct Count to calculate the gap between total records and unique values
Analyze value distribution on a picklist or categorical fieldAdvanced Uniqueness AnalysisReview Entropy (normalized against max), Max Frequency, and Rarity
Detect templated content in text fields before an AI projectAdvanced Uniqueness AnalysisReview Boilerplate Rate, Boilerplate Percentage, and Boilerplate Records Count
Determine whether a “healthy” uniqueness score hides deeper problemsAdvanced Uniqueness AnalysisPair Uniqueness Rate with Entropy (for distribution skew) or Boilerplate Rate (for content originality)

For a full reference of all 8 uniqueness metrics, the three diagnostic layers, and configuration details, return to the main Uniqueness article.

Ready to measure your own data quality? Take the AI Readiness Assessment to see your uniqueness scores and more.