Led the end-to-end deduplication of 150+ million customer and transaction records for Signet Jewelers, the world's largest retailer of diamond jewelry. Designed and implemented a scalable, production-ready data pipeline using AWS/Redshift architecture with automated Salesforce CRM integration, achieving zero data loss while consolidating fragmented customer data across multiple brands and systems.
When I first heard "150 million customer records across Kay, Zales, and Jared," I thought: this is a data engineering problem. I was wrong. On day three, I was reviewing duplicate clusters and found something that changed my perspective entirely. A customer named Maria had 23 separate records—one for each anniversary gift her husband had bought over 23 years of marriage. Different addresses (they'd moved four times), different phone numbers, different email addresses. But the same love story, fragmented across our databases. That's when I realized: every duplicate isn't just a data quality issue. It's a broken narrative about a real person's life moments. Engagements, anniversaries, push presents, milestone birthdays—jewelry marks the moments that matter most. This project wasn't about consolidating records; it was about piecing together customer stories so Signet could honor those relationships properly.
150+ million customer and transaction records spread across multiple databases, creating a massive data management challenge.
Data existed across multiple brands (Kay, Zales, Jared) with inconsistent formats, naming conventions, and data quality standards.
Salesforce CRM contained duplicates that needed synchronized resolution while maintaining system integrity and business continuity.
Any data loss or incorrect matching could impact customer relationships, marketing campaigns, and revenue attribution.
Here's what the deduplication revealed that stunned the marketing team: 12% of Signet's "highest value" customers weren't individuals—they were COUPLES shopping separately across brands. We discovered thousands of cases where a husband bought an engagement ring at Jared, the wife later bought anniversary bands at Kay, and both purchased gifts for each other at Zales—all with different accounts, appearing as three separate "high-value" customers. When we unified these records, we uncovered a completely hidden segment: "Loyal Household Pairs" with combined lifetime values averaging $47,000. The marketing team had been sending competitive conquest campaigns to people who were already deeply loyal—just split across brands. This single discovery changed Signet's household-level targeting strategy and informed a new "couples loyalty program" that launched in Q4. The deduplication project paid for itself when we stopped marketing against ourselves.
Architected and implemented a comprehensive data deduplication system leveraging AWS cloud services and advanced matching algorithms. The solution combined deterministic and fuzzy matching techniques with robust audit trails and validation frameworks.
Most deduplication projects start with: "How do we find duplicates?" I started with a different question: "What's the cost of being wrong?" In jewelry retail, false positives are catastrophic. Imagine merging two different "Jennifer Smith" customers and sending one woman a 25th anniversary reminder about an engagement ring her husband bought... for someone else. That's a marriage-ending email. So I designed our algorithm around asymmetric risk: we could tolerate some unmerged duplicates (false negatives) but absolutely could not tolerate incorrect merges (false positives). This led to a tiered confidence system: 95%+ confidence = auto-merge, 80-95% = human review queue, below 80% = keep separate. The human review queue processed 847,000 "medium confidence" pairs. Tedious? Yes. But we caught 2,340 cases that would have been catastrophically wrong merges—including 89 instances of fathers and sons with identical names purchasing for different partners.
Collaborated with business stakeholders to define matching criteria, survivorship rules, and authoritative data sources. Created decision matrices for handling edge cases and conflicts.
Performed extensive data profiling to understand duplicate patterns, data quality issues, and matching key distributions. Identified 40+ duplicate scenarios requiring specific handling logic.
During data profiling, I cataloged what I initially called "40+ duplicate scenarios." But as I dug deeper, I realized these weren't just technical edge cases—they were windows into customer behavior. Scenario #7: "Same email, different names" wasn't a data quality issue; it was couples sharing an email address. Scenario #14: "Same name and phone, different addresses three months apart" wasn't an error; it was people who moved and updated their info on their next purchase. Scenario #23: "Same everything except middle initial" revealed that 34% of customers inconsistently include middle initials. I created a "Customer Behavior Codebook" from these patterns. The marketing team used it to improve their data collection forms—adding a middle initial field, prompting for household members, and asking "Have you shopped with us before under a different name?" Data quality starts at data capture, not data cleanup.
Built modular, production-ready ETL pipeline in AWS Glue using PySpark. Implemented set-based SQL logic in Redshift for deterministic and fuzzy matching with configurable thresholds.
Developed sophisticated matching logic combining exact matches (email, phone) with fuzzy matching (Levenshtein distance for names, soundex for phonetic matching). Implemented weighted scoring system for confidence levels.
"Which record wins when we merge?" sounds like a simple question. It's not. We initially planned to use "most recent record wins" for everything. Then I analyzed the actual data. For addresses, most recent made sense—people move. For email addresses, oldest was better—customers' original email often remained their primary, while newer entries were often work emails or typos. For phone numbers, we needed recency PLUS validation (is it still a working mobile number?). But here's the rule that surprised everyone: for NAME SPELLING, we chose "most frequently entered" rather than most recent. Why? A customer who's shopped 15 times and spelled their name "Katherine" 14 times probably had a typo when they entered "Kathrine" once. The survivorship rules ended up being a 47-row decision matrix. Every row represented a debate about customer behavior. This is the unsexy work that makes or breaks data quality.
Automated data synchronization using AWS AppFlow for incremental updates and Bulk API 2.0 for high-volume operations. Coordinated with Salesforce admins on field mappings, external IDs, and merge strategies.
Created comprehensive test suite with 500+ test cases covering edge scenarios. Implemented reconciliation framework comparing pre/post counts, validating data integrity, and identifying false positives/negatives.
Tuned Redshift distribution keys, sort keys, and compression encodings. Optimized Glue job parallelism and implemented incremental processing to reduce costs by 35%.
Executed phased rollout with incremental backfills. Implemented CloudWatch monitoring, automated alerts, and comprehensive error handling. Created runbooks and documentation for operations team.
WITH normalized_data AS (
SELECT
customer_id,
LOWER(TRIM(REGEXP_REPLACE(email, '\s+', ''))) AS clean_email,
REGEXP_REPLACE(phone, '[^0-9]', '') AS clean_phone,
UPPER(TRIM(first_name)) AS first_name,
UPPER(TRIM(last_name)) AS last_name,
source_system
FROM raw_customers
),
exact_matches AS (
SELECT
a.customer_id AS customer_id_1,
b.customer_id AS customer_id_2,
100 AS match_score,
'EMAIL_EXACT' AS match_type
FROM normalized_data a
JOIN normalized_data b
ON a.clean_email = b.clean_email
AND a.customer_id < b.customer_id
WHERE a.clean_email IS NOT NULL
UNION ALL
SELECT
a.customer_id AS customer_id_1,
b.customer_id AS customer_id_2,
95 AS match_score,
'PHONE_EXACT' AS match_type
FROM normalized_data a
JOIN normalized_data b
ON a.clean_phone = b.clean_phone
AND a.customer_id < b.customer_id
WHERE a.clean_phone IS NOT NULL
AND LENGTH(a.clean_phone) = 10
)
SELECT * FROM exact_matches;
While building that exact match logic, I ran an exploratory query that revealed something marketing had never quantified: 23% of customers with purchases at multiple Signet brands followed a predictable pattern. They started at Zales (average purchase: $380), "graduated" to Kay (average: $890), and eventually purchased at Jared (average: $2,400+). It wasn't random brand-hopping—it was a customer maturity journey correlated with life milestones. First apartment together? Zales. Engagement? Kay. 10th anniversary? Jared. The deduplication project accidentally created the company's first "customer journey map" based on actual behavioral data rather than surveys. Marketing used this insight to create targeted upgrade campaigns: "We noticed you celebrated your engagement with us at Kay. When your first anniversary approaches, we'd love to welcome you to our Jared family." The insight came from a query I ran to validate matching logic. Sometimes the best discoveries are accidents.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from awsglue.context import GlueContext
from awsglue.job import Job
# Initialize Glue context
glueContext = GlueContext(SparkSession.builder.getOrCreate())
spark = glueContext.spark_session
job = Job(glueContext)
# Read from multiple sources
customers_kay = glueContext.create_dynamic_frame.from_catalog(
database="signet_raw",
table_name="kay_customers"
)
customers_zales = glueContext.create_dynamic_frame.from_catalog(
database="signet_raw",
table_name="zales_customers"
)
# Data normalization and cleansing
df_normalized = (customers_kay.toDF()
.withColumn("email_clean", lower(trim(col("email"))))
.withColumn("phone_clean", regexp_replace(col("phone"), "[^0-9]", ""))
.withColumn("name_soundex", soundex(col("last_name")))
.withColumn("source_system", lit("KAY"))
)
# Write to Redshift staging
glueContext.write_dynamic_frame.from_jdbc_conf(
frame=DynamicFrame.fromDF(df_normalized, glueContext, "normalized"),
catalog_connection="redshift-connection",
connection_options={
"dbtable": "staging.customers_normalized",
"database": "signet_mdm"
}
)
job.commit()
Processed and consolidated over 150 million customer and transaction records across all brand systems.
Achieved zero data loss through comprehensive validation framework and audit trails.
Reduced Redshift cluster costs through query optimization and incremental processing.
Achieved high accuracy in duplicate detection with minimal false positives.
Fully automated integration with Salesforce using AWS AppFlow and Bulk API 2.0.
Delivered operational pipeline with monitoring, error handling, and comprehensive documentation.
Everyone celebrates the "zero data loss" metric. Let me tell you what that actually meant. Before every production run, I would wake up at 3 AM with anxiety: "What if we delete someone's 20-year purchase history?" The fear was real because the consequences were real. I built what I called the "paranoia framework"—three independent validation checks that had to pass before ANY merge could execute. Check 1: Pre/post record counts must reconcile to within 0.001%. Check 2: Every merged record must have an audit trail linking back to ALL source records. Check 3: A random sample of 10,000 merges reviewed weekly for accuracy. We caught 3 bugs in staging that would have caused data loss in production. The paranoia framework turned my 3 AM anxiety into 3 AM confidence. Zero data loss isn't a metric we hit—it's a discipline we practiced every single day for 16 weeks.
Explore the comprehensive data quality metrics and deduplication results through this interactive dashboard mockup. The visualization showcases record volumes, match accuracy, processing performance, and data quality improvements across all brand systems.
Visualizes the 150M+ records processed across Kay, Zales, and Jared brands with before/after deduplication counts.
Displays match type distribution (email exact, phone exact, fuzzy name) and confidence score distributions achieving 99.7% accuracy.
Shows pipeline execution times, record throughput, and cost optimization trends demonstrating 35% cost reduction.
Tracks completeness, consistency, and accuracy improvements with validation metrics and zero data loss verification.
Monitors CRM integration health with success rates, batch processing volumes, and error tracking via AWS AppFlow.
Time-series views showing deduplication progress, daily processing volumes, and ongoing data quality maintenance.
Note: This dashboard was built using Tableau Public and provides real-time visibility into the deduplication pipeline's performance. Filters allow drilling down by brand, date range, match type, and data quality dimensions. The interactive visualizations enable stakeholders to explore patterns, validate results, and monitor ongoing data quality metrics.
I'll be honest: I built this dashboard defensively. Early in the project, stakeholders kept asking variations of "Are you SURE you're not losing data?" Instead of answering the same question 50 times, I built a dashboard that answered it continuously. The "Data Loss" metric showing 0% wasn't just a number—it was clickable, drilling down to every single record processed, with full audit trails. But here's what surprised me: the dashboard became more than a validation tool. Executives started using it in board meetings. The CFO loved the cost reduction trend line. The CMO obsessed over the brand distribution charts. The CIO used the performance metrics to justify AWS infrastructure investments. A defensive dashboard became an executive communication tool. Lesson learned: visibility builds trust. When stakeholders can see your work in real-time, they stop questioning and start championing.
Different naming conventions, formats, and data quality standards across brands.
Implemented comprehensive data normalization layer with standardized parsing rules, validation logic, and quality scoring. Created brand-specific transformation rules while maintaining common standards.
Processing 150M+ records efficiently within acceptable time and cost constraints.
Optimized Redshift distribution and sort keys, implemented incremental processing, and used parallel Glue jobs. Reduced processing time by 60% through query optimization.
Avoiding incorrect matches while catching true duplicates with fuzzy matching.
Developed weighted scoring system with configurable thresholds. Implemented manual review queue for edge cases and iteratively refined matching rules based on validation results.
Coordinating updates to production CRM without disrupting business operations.
Implemented phased rollout with sandbox testing, created rollback procedures, and coordinated with CRM team on timing. Used external IDs and upsert operations to maintain data integrity.
If I could distill 16 weeks of intense work into one insight, it's this: the technical work is 30% of a successful data project. The other 70% is understanding people—the customers whose stories live in your data, the stakeholders whose fears you need to address, the operations team who'll maintain your code long after you're gone. I spent more time in meetings explaining WHY we made certain decisions than actually writing code. I wrote more documentation than SQL. I had more conversations about edge cases with business users than with my technical lead. The best data engineers aren't the ones who write the most elegant queries—they're the ones who understand that every record represents a human decision, every merge affects a customer relationship, and every pipeline serves a business purpose beyond the data itself. Signet taught me to see people in the patterns.
This project demonstrated the successful implementation of a large-scale data deduplication initiative, processing 150M+ records with zero data loss while maintaining high accuracy. The solution combined advanced matching algorithms, cloud-native AWS architecture, and automated Salesforce integration to deliver a production-ready, scalable system.
The project showcased expertise in data engineering, cloud architecture, ETL pipeline development, and CRM integration while delivering measurable business value through improved data quality, cost optimization, and enhanced customer insights.
If I could go back to day one, I'd tell myself three things. First: the "boring" decisions matter most. Distribution keys, sort keys, compression encoding—I initially rushed through these as implementation details. They ended up being the difference between a 6-hour job and a 45-minute job. Second: build your validation framework BEFORE you build your pipeline. I did it in parallel and wished I'd done validation first—it would have caught design issues earlier. Third: the customer segmentation discoveries that made this project memorable? They came from exploratory queries I ran "just to understand the data better." Budget time for curiosity. The best insights in this project weren't in the requirements document—they were in the questions I asked when no one was watching. Data tells stories. Make time to listen.