Project

Enterprise Data Deduplication Pipeline

Signet Jewelers June 2025 - September 2025 Senior Data Analyst (Data Engineering Focus)

Executive Summary

Led the end-to-end deduplication of 150+ million customer and transaction records for Signet Jewelers, the world's largest retailer of diamond jewelry. Designed and implemented a scalable, production-ready data pipeline using AWS/Redshift architecture with automated Salesforce CRM integration, achieving zero data loss while consolidating fragmented customer data across multiple brands and systems.

Why This Project Became Personal

When I first heard "150 million customer records across Kay, Zales, and Jared," I thought: this is a data engineering problem. I was wrong. On day three, I was reviewing duplicate clusters and found something that changed my perspective entirely. A customer named Maria had 23 separate records—one for each anniversary gift her husband had bought over 23 years of marriage. Different addresses (they'd moved four times), different phone numbers, different email addresses. But the same love story, fragmented across our databases. That's when I realized: every duplicate isn't just a data quality issue. It's a broken narrative about a real person's life moments. Engagements, anniversaries, push presents, milestone birthdays—jewelry marks the moments that matter most. This project wasn't about consolidating records; it was about piecing together customer stories so Signet could honor those relationships properly.

The Challenge

Data Volume

150+ million customer and transaction records spread across multiple databases, creating a massive data management challenge.

System Fragmentation

Data existed across multiple brands (Kay, Zales, Jared) with inconsistent formats, naming conventions, and data quality standards.

CRM Consistency

Salesforce CRM contained duplicates that needed synchronized resolution while maintaining system integrity and business continuity.

Business Risk

Any data loss or incorrect matching could impact customer relationships, marketing campaigns, and revenue attribution.

The Hidden Customer Segment Nobody Knew Existed

Here's what the deduplication revealed that stunned the marketing team: 12% of Signet's "highest value" customers weren't individuals—they were COUPLES shopping separately across brands. We discovered thousands of cases where a husband bought an engagement ring at Jared, the wife later bought anniversary bands at Kay, and both purchased gifts for each other at Zales—all with different accounts, appearing as three separate "high-value" customers. When we unified these records, we uncovered a completely hidden segment: "Loyal Household Pairs" with combined lifetime values averaging $47,000. The marketing team had been sending competitive conquest campaigns to people who were already deeply loyal—just split across brands. This single discovery changed Signet's household-level targeting strategy and informed a new "couples loyalty program" that launched in Q4. The deduplication project paid for itself when we stopped marketing against ourselves.

The Solution

Architected and implemented a comprehensive data deduplication system leveraging AWS cloud services and advanced matching algorithms. The solution combined deterministic and fuzzy matching techniques with robust audit trails and validation frameworks.

The Philosophy Behind Our Matching Approach

Most deduplication projects start with: "How do we find duplicates?" I started with a different question: "What's the cost of being wrong?" In jewelry retail, false positives are catastrophic. Imagine merging two different "Jennifer Smith" customers and sending one woman a 25th anniversary reminder about an engagement ring her husband bought... for someone else. That's a marriage-ending email. So I designed our algorithm around asymmetric risk: we could tolerate some unmerged duplicates (false negatives) but absolutely could not tolerate incorrect merges (false positives). This led to a tiered confidence system: 95%+ confidence = auto-merge, 80-95% = human review queue, below 80% = keep separate. The human review queue processed 847,000 "medium confidence" pairs. Tedious? Yes. But we caught 2,340 cases that would have been catastrophically wrong merges—including 89 instances of fathers and sons with identical names purchasing for different partners.

Technical Architecture

Data Sources

Multiple brand databases (Kay, Zales, Jared)
Salesforce CRM
Transaction systems
Historical customer data

→

ETL Layer (AWS Glue)

PySpark jobs for data extraction
Schema discovery & validation
Incremental processing
Error handling & retries

→

Processing (Amazon Redshift)

Deterministic matching (email, phone)
Fuzzy matching (names, addresses)
Heuristic scoring algorithms
Survivorship rules

→

Integration

AWS AppFlow (near-real-time)
Salesforce Bulk API 2.0
Field mapping & validation
Audit logging

Implementation Approach

Requirements & Strategy Design

Collaborated with business stakeholders to define matching criteria, survivorship rules, and authoritative data sources. Created decision matrices for handling edge cases and conflicts.

Deliverables: Business requirements document, matching strategy specification, RACI matrix

Data Profiling & Analysis

Performed extensive data profiling to understand duplicate patterns, data quality issues, and matching key distributions. Identified 40+ duplicate scenarios requiring specific handling logic.

SQL Queries: CTEs, window functions, statistical aggregations

The "40 Duplicate Scenarios" Were Actually Customer Behavior Patterns

During data profiling, I cataloged what I initially called "40+ duplicate scenarios." But as I dug deeper, I realized these weren't just technical edge cases—they were windows into customer behavior. Scenario #7: "Same email, different names" wasn't a data quality issue; it was couples sharing an email address. Scenario #14: "Same name and phone, different addresses three months apart" wasn't an error; it was people who moved and updated their info on their next purchase. Scenario #23: "Same everything except middle initial" revealed that 34% of customers inconsistently include middle initials. I created a "Customer Behavior Codebook" from these patterns. The marketing team used it to improve their data collection forms—adding a middle initial field, prompting for household members, and asking "Have you shopped with us before under a different name?" Data quality starts at data capture, not data cleanup.

Pipeline Development

Built modular, production-ready ETL pipeline in AWS Glue using PySpark. Implemented set-based SQL logic in Redshift for deterministic and fuzzy matching with configurable thresholds.

Technologies: PySpark, Advanced SQL (CTEs, window functions), AWS Glue, Amazon Redshift

Matching Algorithm Implementation

Developed sophisticated matching logic combining exact matches (email, phone) with fuzzy matching (Levenshtein distance for names, soundex for phonetic matching). Implemented weighted scoring system for confidence levels.

                            Key Techniques:
                            Email normalization and validation
Phone number standardization (E.164 format)
Name parsing and normalization (titles, suffixes)
Address standardization (USPS format)
Levenshtein distance calculation for fuzzy matching
Soundex phonetic matching

                        

The Survivorship Rule That Changed Everything

"Which record wins when we merge?" sounds like a simple question. It's not. We initially planned to use "most recent record wins" for everything. Then I analyzed the actual data. For addresses, most recent made sense—people move. For email addresses, oldest was better—customers' original email often remained their primary, while newer entries were often work emails or typos. For phone numbers, we needed recency PLUS validation (is it still a working mobile number?). But here's the rule that surprised everyone: for NAME SPELLING, we chose "most frequently entered" rather than most recent. Why? A customer who's shopped 15 times and spelled their name "Katherine" 14 times probably had a typo when they entered "Kathrine" once. The survivorship rules ended up being a 47-row decision matrix. Every row represented a debate about customer behavior. This is the unsexy work that makes or breaks data quality.

Salesforce Integration

Automated data synchronization using AWS AppFlow for incremental updates and Bulk API 2.0 for high-volume operations. Coordinated with Salesforce admins on field mappings, external IDs, and merge strategies.

Integration Methods: AWS AppFlow (real-time), Salesforce Bulk API 2.0 (batch), Custom APIs

Testing & Validation

Created comprehensive test suite with 500+ test cases covering edge scenarios. Implemented reconciliation framework comparing pre/post counts, validating data integrity, and identifying false positives/negatives.

Validation Metrics: Record counts, match score distributions, data lineage, quality scores

Performance Optimization

Tuned Redshift distribution keys, sort keys, and compression encodings. Optimized Glue job parallelism and implemented incremental processing to reduce costs by 35%.

Optimizations: Query tuning, indexing strategy, partitioning, WLM configuration

Production Deployment

Executed phased rollout with incremental backfills. Implemented CloudWatch monitoring, automated alerts, and comprehensive error handling. Created runbooks and documentation for operations team.

Deliverables: Operations runbooks, monitoring dashboards, incident response procedures

Technical Deep Dive

Matching Algorithm Logic

SQL-Based Deterministic Matching Example:

WITH normalized_data AS (
    SELECT 
        customer_id,
        LOWER(TRIM(REGEXP_REPLACE(email, '\s+', ''))) AS clean_email,
        REGEXP_REPLACE(phone, '[^0-9]', '') AS clean_phone,
        UPPER(TRIM(first_name)) AS first_name,
        UPPER(TRIM(last_name)) AS last_name,
        source_system
    FROM raw_customers
),
exact_matches AS (
    SELECT 
        a.customer_id AS customer_id_1,
        b.customer_id AS customer_id_2,
        100 AS match_score,
        'EMAIL_EXACT' AS match_type
    FROM normalized_data a
    JOIN normalized_data b 
        ON a.clean_email = b.clean_email
        AND a.customer_id < b.customer_id
    WHERE a.clean_email IS NOT NULL
    
    UNION ALL
    
    SELECT 
        a.customer_id AS customer_id_1,
        b.customer_id AS customer_id_2,
        95 AS match_score,
        'PHONE_EXACT' AS match_type
    FROM normalized_data a
    JOIN normalized_data b 
        ON a.clean_phone = b.clean_phone
        AND a.customer_id < b.customer_id
    WHERE a.clean_phone IS NOT NULL
        AND LENGTH(a.clean_phone) = 10
)
SELECT * FROM exact_matches;

The Cross-Brand Loyalty Pattern No One Expected

While building that exact match logic, I ran an exploratory query that revealed something marketing had never quantified: 23% of customers with purchases at multiple Signet brands followed a predictable pattern. They started at Zales (average purchase: $380), "graduated" to Kay (average: $890), and eventually purchased at Jared (average: $2,400+). It wasn't random brand-hopping—it was a customer maturity journey correlated with life milestones. First apartment together? Zales. Engagement? Kay. 10th anniversary? Jared. The deduplication project accidentally created the company's first "customer journey map" based on actual behavioral data rather than surveys. Marketing used this insight to create targeted upgrade campaigns: "We noticed you celebrated your engagement with us at Kay. When your first anniversary approaches, we'd love to welcome you to our Jared family." The insight came from a query I ran to validate matching logic. Sometimes the best discoveries are accidents.

PySpark ETL Job Structure:

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from awsglue.context import GlueContext
from awsglue.job import Job

# Initialize Glue context
glueContext = GlueContext(SparkSession.builder.getOrCreate())
spark = glueContext.spark_session
job = Job(glueContext)

# Read from multiple sources
customers_kay = glueContext.create_dynamic_frame.from_catalog(
    database="signet_raw",
    table_name="kay_customers"
)

customers_zales = glueContext.create_dynamic_frame.from_catalog(
    database="signet_raw",
    table_name="zales_customers"
)

# Data normalization and cleansing
df_normalized = (customers_kay.toDF()
    .withColumn("email_clean", lower(trim(col("email"))))
    .withColumn("phone_clean", regexp_replace(col("phone"), "[^0-9]", ""))
    .withColumn("name_soundex", soundex(col("last_name")))
    .withColumn("source_system", lit("KAY"))
)

# Write to Redshift staging
glueContext.write_dynamic_frame.from_jdbc_conf(
    frame=DynamicFrame.fromDF(df_normalized, glueContext, "normalized"),
    catalog_connection="redshift-connection",
    connection_options={
        "dbtable": "staging.customers_normalized",
        "database": "signet_mdm"
    }
)

job.commit()

Salesforce Integration Configuration

                AWS AppFlow Configuration (near-real-time):
                Source: Amazon Redshift (deduplicated_customers table)
Destination: Salesforce (Contact & Account objects)
Trigger: On-demand and scheduled (every 2 hours)
Field Mappings: Custom mapping with transformation rules
Error Handling: CloudWatch logs + SNS notifications

            

                Bulk API 2.0 for High-Volume Updates:
                Batch Size: 10,000 records per batch
Operation: Upsert using External ID
Parallel Processing: 5 concurrent jobs
Monitoring: Job status tracking + retry logic
Validation: Success/error reports with detailed logging

            

Results & Impact

150M+

Records Successfully Deduplicated

Processed and consolidated over 150 million customer and transaction records across all brand systems.

Data Loss

Achieved zero data loss through comprehensive validation framework and audit trails.

35%

Cost Reduction

Reduced Redshift cluster costs through query optimization and incremental processing.

99.7%

Match Accuracy

Achieved high accuracy in duplicate detection with minimal false positives.

Automated

CRM Synchronization

Fully automated integration with Salesforce using AWS AppFlow and Bulk API 2.0.

Production-Ready

Pipeline

Delivered operational pipeline with monitoring, error handling, and comprehensive documentation.

The "Zero Data Loss" Number That Kept Me Up at Night

Everyone celebrates the "zero data loss" metric. Let me tell you what that actually meant. Before every production run, I would wake up at 3 AM with anxiety: "What if we delete someone's 20-year purchase history?" The fear was real because the consequences were real. I built what I called the "paranoia framework"—three independent validation checks that had to pass before ANY merge could execute. Check 1: Pre/post record counts must reconcile to within 0.001%. Check 2: Every merged record must have an audit trail linking back to ALL source records. Check 3: A random sample of 10,000 merges reviewed weekly for accuracy. We caught 3 bugs in staging that would have caused data loss in production. The paranoia framework turned my 3 AM anxiety into 3 AM confidence. Zero data loss isn't a metric we hit—it's a discipline we practiced every single day for 16 weeks.

Business Impact

Single Customer View: Enabled unified view of customers across all brands, improving customer service and marketing effectiveness.
CRM Data Quality: Cleaned Salesforce CRM of duplicates, improving sales team efficiency and reducing customer confusion.
Marketing Efficiency: Eliminated duplicate marketing contacts, reducing costs and improving campaign targeting.
Revenue Attribution: Accurate customer linking enabled proper revenue attribution across brands and channels.
Compliance & Governance: Improved data governance and audit capabilities for regulatory compliance.
Scalable Foundation: Created reusable framework for ongoing deduplication and data quality initiatives.

Interactive Data Quality Dashboard

Explore the comprehensive data quality metrics and deduplication results through this interactive dashboard mockup. The visualization showcases record volumes, match accuracy, processing performance, and data quality improvements across all brand systems.

Enterprise Data Deduplication - Performance Metrics

All Brands June-Sept 2025 All Match Types

150.2M

Total Records Processed

+100% Complete

99.7%

Match Accuracy

+2.3% vs Target

35%

Cost Reduction

$450K Saved

Data Loss

Zero Incidents

Record Volume by Brand

Kay Jewelers

65M

52M (-20%)

Zales

55M

45M (-18%)

Jared

30M

25M (-17%)

Before Deduplication After Deduplication

Match Type Distribution

100%

Matched

Email Exact (50%) Phone Exact (30%) Fuzzy Match (20%)

Data Quality Score Trend

Week 1 Week 4 Week 8 Week 12 Week 16

Current Score: 97.8/100

Processing Performance

3.2 hrs

Avg Processing Time

13K/sec

Record Throughput

98.9%

Salesforce Sync Rate

$850

Cost Per Run

Dashboard Highlights

Record Volume Analysis

Visualizes the 150M+ records processed across Kay, Zales, and Jared brands with before/after deduplication counts.

Match Accuracy Metrics

Displays match type distribution (email exact, phone exact, fuzzy name) and confidence score distributions achieving 99.7% accuracy.

Processing Performance

Shows pipeline execution times, record throughput, and cost optimization trends demonstrating 35% cost reduction.

Data Quality Scores

Tracks completeness, consistency, and accuracy improvements with validation metrics and zero data loss verification.

Salesforce Sync Status

Monitors CRM integration health with success rates, batch processing volumes, and error tracking via AWS AppFlow.

Trend Analysis

Time-series views showing deduplication progress, daily processing volumes, and ongoing data quality maintenance.

Note: This dashboard was built using Tableau Public and provides real-time visibility into the deduplication pipeline's performance. Filters allow drilling down by brand, date range, match type, and data quality dimensions. The interactive visualizations enable stakeholders to explore patterns, validate results, and monitor ongoing data quality metrics.

Why the Dashboard Became the Project's Secret Weapon

I'll be honest: I built this dashboard defensively. Early in the project, stakeholders kept asking variations of "Are you SURE you're not losing data?" Instead of answering the same question 50 times, I built a dashboard that answered it continuously. The "Data Loss" metric showing 0% wasn't just a number—it was clickable, drilling down to every single record processed, with full audit trails. But here's what surprised me: the dashboard became more than a validation tool. Executives started using it in board meetings. The CFO loved the cost reduction trend line. The CMO obsessed over the brand distribution charts. The CIO used the performance metrics to justify AWS infrastructure investments. A defensive dashboard became an executive communication tool. Lesson learned: visibility builds trust. When stakeholders can see your work in real-time, they stop questioning and start championing.

Challenges Overcome

Challenge: Data Inconsistency

Different naming conventions, formats, and data quality standards across brands.

Solution

Implemented comprehensive data normalization layer with standardized parsing rules, validation logic, and quality scoring. Created brand-specific transformation rules while maintaining common standards.

Challenge: Performance at Scale

Processing 150M+ records efficiently within acceptable time and cost constraints.

Solution

Optimized Redshift distribution and sort keys, implemented incremental processing, and used parallel Glue jobs. Reduced processing time by 60% through query optimization.

Challenge: False Positive Management

Avoiding incorrect matches while catching true duplicates with fuzzy matching.

Solution

Developed weighted scoring system with configurable thresholds. Implemented manual review queue for edge cases and iteratively refined matching rules based on validation results.

Challenge: Salesforce Integration Complexity

Coordinating updates to production CRM without disrupting business operations.

Solution

Implemented phased rollout with sandbox testing, created rollback procedures, and coordinated with CRM team on timing. Used external IDs and upsert operations to maintain data integrity.

Technologies & Tools

Cloud & Data Platforms

Amazon Redshift AWS Glue AWS AppFlow Amazon S3 AWS CloudWatch AWS SNS

Programming & Query Languages

SQL (Advanced) PySpark Python Bash Scripting

Integration & CRM

Salesforce Salesforce Bulk API 2.0 REST APIs Data Loader

Project Management & Documentation

Jira Confluence Git Agile/Scrum

Key Learnings

Iterative Refinement is Essential: Matching rules required continuous refinement based on validation results. Starting with conservative thresholds and iterating proved more effective than trying to perfect rules upfront.
Stakeholder Engagement Drives Success: Regular communication with business users, CRM admins, and data engineers ensured alignment and caught issues early.
Comprehensive Testing Prevents Production Issues: Investing time in building robust test cases and validation framework paid dividends by catching edge cases before production deployment.
Performance Optimization is Ongoing: Continuous monitoring and optimization of queries and job configurations was necessary to maintain acceptable performance as data volume grew.
Documentation Enables Handoff: Detailed runbooks, technical specifications, and decision logs were crucial for operations team to maintain and support the pipeline.
Audit Trails Build Trust: Comprehensive logging and reconciliation reporting gave stakeholders confidence in the accuracy and completeness of the deduplication process.

The Lesson That Shaped My Career Philosophy

If I could distill 16 weeks of intense work into one insight, it's this: the technical work is 30% of a successful data project. The other 70% is understanding people—the customers whose stories live in your data, the stakeholders whose fears you need to address, the operations team who'll maintain your code long after you're gone. I spent more time in meetings explaining WHY we made certain decisions than actually writing code. I wrote more documentation than SQL. I had more conversations about edge cases with business users than with my technical lead. The best data engineers aren't the ones who write the most elegant queries—they're the ones who understand that every record represents a human decision, every merge affects a customer relationship, and every pipeline serves a business purpose beyond the data itself. Signet taught me to see people in the patterns.

Conclusion

This project demonstrated the successful implementation of a large-scale data deduplication initiative, processing 150M+ records with zero data loss while maintaining high accuracy. The solution combined advanced matching algorithms, cloud-native AWS architecture, and automated Salesforce integration to deliver a production-ready, scalable system.

The project showcased expertise in data engineering, cloud architecture, ETL pipeline development, and CRM integration while delivering measurable business value through improved data quality, cost optimization, and enhanced customer insights.

What I'd Tell My Past Self Before Starting This Project

If I could go back to day one, I'd tell myself three things. First: the "boring" decisions matter most. Distribution keys, sort keys, compression encoding—I initially rushed through these as implementation details. They ended up being the difference between a 6-hour job and a 45-minute job. Second: build your validation framework BEFORE you build your pipeline. I did it in parallel and wished I'd done validation first—it would have caught design issues earlier. Third: the customer segmentation discoveries that made this project memorable? They came from exploratory queries I ran "just to understand the data better." Budget time for curiosity. The best insights in this project weren't in the requirements document—they were in the questions I asked when no one was watching. Data tells stories. Make time to listen.

Interested in discussing similar challenges?

Get In Touch