Product Analytics & A/B Testing Framework - Project

Executive Summary

Built a repeatable experimentation and product analytics platform enabling rapid A/B testing, feature flag analytics, and funnel measurement across web and mobile products. Reduced time-to-insight from days to hours and increased win rate for launches by 42%.

Why A/B Testing Changed How I Think About Data

Before diving into experimentation platforms, I was a "trust the data" analyst. I assumed numbers told the whole story. My first week managing A/B tests shattered that illusion. We ran a checkout flow experiment that showed a 12% conversion lift with 99% statistical significance—we shipped it. Two weeks later, revenue dropped. Why? The variant attracted more discount-seekers who converted once but never returned. I learned the hard way: A/B testing isn't about finding "winners"—it's about understanding user behavior holistically. Now I never declare a winner without analyzing downstream metrics: retention, LTV, support tickets, and segment-specific impacts. That painful lesson became the foundation for everything I built at Syilum.

Scope

Platform design, instrumentation, experimentation framework, dashboards, and automated reporting for product teams (2016-2022).

Users

100M+ events collected; 30+ concurrent experiments; 12 product teams onboarded.

Impact

42% higher launch win rate; 15% improvement in retention for tested features; 20% increase in experiment velocity.

Business Challenge

Ad-hoc instrumentation causing inconsistent metrics across teams
Lengthy experiment setup and analysis (weeks)
No single source of truth for experiment events and cohort definitions
Lack of integrated reporting for business and product stakeholders

The "Same Event, Different Meaning" Problem

The inconsistency issue was worse than leadership realized. When I audited our event tracking, I discovered that "purchase_complete" meant different things to different teams. For the mobile team, it fired when the order confirmation screen loaded. For web, it fired after payment processing succeeded. For the iOS team, it included in-app purchases but not subscriptions. We were comparing apples to oranges across every experiment. I spent two weeks just mapping out every event definition across 15 products, creating what I called the "Event Truth Document." It was 47 pages of discrepancies. That document became the basis for our standardized event schema, and I made it a rule: no new event gets added without a formal definition review. Boring? Yes. Essential? Absolutely.

Product Analytics Dashboard

Real-time A/B testing performance, experiment health monitoring, funnel conversion analysis, and cohort retention tracking across web and mobile platforms.

📊

Active Experiments

+8 this month

📈

Avg Conversion Lift

+3.7%

+1.2% vs baseline

👥

Daily Active Users

487K

+12% MoM

⚡

Daily Events Tracked

1.2M

Stable

Top Performing Experiments (Last 30 Days)

Checkout Flow Redesign

+8.2% CVR p-value: 0.003

Onboarding V3

+6.4% Completion p-value: 0.012

Pricing Page Layout

+4.7% Sign-ups p-value: 0.028

Mobile Navigation

+3.2% Engagement p-value: 0.045

Feature Discovery UI

+0.8% Adoption p-value: 0.341

Statistically Significant (p < 0.05) Not Significant

The Checkout Flow Surprise

That 8.2% checkout flow lift? It almost didn't happen. The original hypothesis was to simplify checkout by removing the "review order" step entirely. Initial tests showed AMAZING results—conversion jumped 15%! We were ready to ship. But I insisted on running the test for a full 4 weeks instead of the standard 2. Here's what happened: week 1-2, conversion soared. Week 3-4, it dropped. Returns and support tickets spiked. Turns out, users were accidentally ordering wrong quantities without the review step. The final winning design? We kept the review step but redesigned it as a visual confirmation with editable quantities inline. Sometimes "less friction" creates more problems downstream. Always measure the full customer journey.

Conversion Funnel: Control vs Treatment

Control Group (n=47,250) Treatment Group (n=47,810)

The Funnel Numbers Lied (Until We Segmented)

Those beautiful funnel numbers you see? They're aggregates—and aggregates can deceive. When I first analyzed the treatment vs control funnel, the lift looked uniform across all stages. But something felt off. The onboarding completion jump from 28.1% to 34.5% seemed too good to be true for a UI change. I decided to segment by user acquisition source. Desktop organic users? 18% lift. Mobile paid users? 4% lift. The "winning" variant worked great for users who found us naturally but barely moved the needle for paid acquisition. We ended up implementing the variant only for organic traffic while running additional tests for paid users. If I'd shipped globally based on aggregate data, we would have wasted ad spend targeting users who didn't respond to the change. Segmentation isn't optional—it's where the real insights hide.

30-Day Cohort Retention by Variant

Control Group Retention Treatment Group Retention

Why Day 30 Retention Became My North Star

I obsess over Day 30 retention because of one experiment that taught me a painful lesson. Early in my Syilum days, we shipped a gamification feature that crushed Day 1 and Day 7 retention metrics. Leadership celebrated. I got praised. Three months later, we noticed something disturbing: users who experienced the gamification had 40% lower 90-day LTV. The dopamine-hit mechanics we'd added created short-term engagement but trained users to expect rewards instead of finding genuine value. They churned hard once the novelty wore off. Since then, I won't call any experiment a "win" until Day 30 data comes in. The 28.6% lift shown here? We tracked those cohorts for 90 days to ensure the retention curve held. It did—because the treatment focused on feature discovery, not manipulation tactics.

Experiment Velocity & Win Rate

Experiments Launched Win Rate (% with positive lift)

Architecture & Data Flow

Client SDKs (Web, iOS, Android)

Event Collection (Kafka + Kinesis)

Streaming ETL (Flink / Spark Streaming)

Warehouse (Snowflake / Redshift)

Analytics & Experimentation (Tableau, Jupyter, Custom ETL)

Why We Built From Scratch (Mostly)

"Why not just use Amplitude or Mixpanel?" I heard this question constantly. Here's the honest answer: for most companies, you absolutely should use off-the-shelf tools. But Syilum had unique constraints that made custom development worth the investment. First, data residency—some of our products served enterprise clients with strict data sovereignty requirements. Second, volume—at 100M+ events daily across 12 products, SaaS pricing became prohibitive. Third, and most importantly, integration depth—we needed experiment data to flow directly into our ML pipeline for personalization. Commercial tools treated experimentation as isolated, but our competitive advantage came from feeding experiment results into real-time recommendation systems. The custom platform took 8 months to build (I'll admit: 3 months longer than estimated), but ROI came within the first year when we could do things competitors couldn't—like personalizing user experiences based on their historical experiment responses.

Implementation Highlights

Standardized event schema — defined core event model and common properties (user_id, session_id, experiment_id, variant, timestamp, platform, event_name, properties)
SDK instrumentation — lightweight JS and mobile SDK wrappers to ensure consistent payloads
Streaming pipeline — low-latency ETL for near-real-time experiment telemetry
Experiment service — feature flags and randomization service ensuring stable cohort assignment
Automated analysis — SQL and Python templates to calculate uplift, p-values, and power; automated Slack/email reports

The Randomization Bug That Haunted Me

Implementation highlight #4—"stable cohort assignment"—sounds simple. It nearly cost us three months of experiment data. Here's what happened: our initial randomization used the current timestamp to seed assignment. We thought this was fine until a mobile user's experiment variant CHANGED when they crossed a timezone boundary during a flight. Same user, suddenly in a different cohort. Our data showed impossible patterns—users completing the onboarding flow in both control AND treatment simultaneously. After a week of debugging (and more than a few late nights), we traced it to 0.3% of users who experienced variant flipping. The fix? Deterministic hashing: SHA256(user_id + experiment_id) mod 100. A user's variant is now mathematically locked forever. I created what I call the "flight test"—we deliberately test experiments across timezone changes before any launch. That 0.3% bug taught me: edge cases in experimentation contaminate entire datasets.

Sample A/B Analysis Code

SQL — Conversion Rate per Variant

-- Calculate conversion rate per variant
WITH impressions AS (
  SELECT
    user_id,
    experiment_id,
    variant,
    COUNT(DISTINCT session_id) AS sessions
  FROM analytics.events
  WHERE experiment_id = 'exp_checkout_flow'
  GROUP BY 1,2,3
),
conversions AS (
  SELECT
    user_id,
    experiment_id,
    variant,
    SUM(CASE WHEN event_name = 'purchase' THEN 1 ELSE 0 END) AS purchases
  FROM analytics.events
  WHERE experiment_id = 'exp_checkout_flow'
  GROUP BY 1,2,3
)
SELECT
  i.variant,
  COUNT(DISTINCT i.user_id) AS users,
  SUM(c.purchases)::float / COUNT(DISTINCT i.user_id) AS conv_rate
FROM impressions i
LEFT JOIN conversions c
  ON i.user_id = c.user_id AND i.experiment_id = c.experiment_id AND i.variant = c.variant
GROUP BY 1
ORDER BY 1;

The SQL Template That Saved 100+ Hours

That SQL query above? It looks simple now, but I originally wrote analysis queries from scratch for each experiment. After the 20th time debugging a JOIN condition at midnight before a launch decision, I realized we needed a better system. I built a query generator that PMs could use without writing SQL. They'd fill in: experiment_id, primary metric (event name), secondary metrics, and date range. The tool generated validated SQL with proper cohort logic, statistical significance calculations, and segment breakdowns. What used to take 4-6 hours of analyst time per experiment now took 5 minutes. More importantly, it eliminated human error—no more accidental INNER JOINs that excluded users who never converted, skewing our metrics. The template library grew to 23 common analysis patterns. I estimate it saved 100+ analyst hours per quarter across the team.

Results

Velocity

Experiment setup time reduced from ~10 days to < 2 days.

Wins

42% increase in successful feature launches after systematic experimentation.

Retention

15% lift in 30-day retention for tested cohorts.

The Counter-Intuitive Truth About Our "42% Win Rate"

That 42% improvement in launch win rate sounds impressive, but here's the surprising part: the biggest impact came from the experiments we DIDN'T ship. Before the platform, product teams shipped features based on gut instinct and executive opinions. 68% of those "gut" launches actually hurt metrics (we know this because we retroactively analyzed pre-platform releases). With systematic experimentation, we killed 34 features that tested poorly before they reached production. Each avoided bad launch saved an average of 2.3 weeks of engineering time that would have gone into iteration and fixes. The platform didn't just help us find winners faster—it gave us permission to fail fast and redirect resources to ideas with actual data-backed potential. The cultural shift was massive: "Let's test it" replaced "My VP thinks we should."

Challenges & Lessons

Careful metric governance and a metrics catalog reduced ambiguity and rework.
Stable randomization and tracking IDs are critical for longitudinal experiments.
Automated quality checks (event validation, schema registry) cut false positives in analysis.

The Hardest Lesson: Statistical Significance ≠ Business Significance

If I could teach one thing to every new product analyst, it's this: p-values don't measure business impact. Early on, I ran experiments with 500,000+ users where tiny changes showed "statistical significance" with p<0.001. A button color change improved click-through by 0.04%—statistically real, practically meaningless. I learned to always ask: "If this result holds at scale, what's the revenue/retention impact?" I built a "Minimum Detectable Effect" calculator into our planning process. Before any experiment launches, PMs must answer: "What's the smallest lift that would justify engineering time to implement?" If the answer is 5%, we don't run tests capable of detecting 0.04% changes. We power experiments appropriately and stop chasing statistical ghosts. Sample size planning isn't sexy, but it's the difference between actionable insights and noise dressed up as data.

My Biggest Professional Growth Moment

Eight years building this platform taught me that data analytics isn't about finding answers—it's about asking better questions. Early in my career, stakeholders would ask "Which variant won?" and I'd give them a number. Now I ask back: "Won for whom? On what timeline? Against what baseline? With what downstream effects?" The shift from "reporting analyst" to "strategic partner" happened when I stopped delivering dashboards and started delivering frameworks for decision-making. My proudest moment wasn't a fancy visualization or a complex model—it was when a PM told me, "I think differently about our product now because of how you taught me to interpret experiments." That's the real impact of good analytics: changing how organizations think, not just what they see.

Tech Stack

Snowflake, Redshift, Kafka, Kinesis

Spark/Flink, dbt, Airflow

Tableau, Jupyter, Python (pandas, scipy)

LaunchDarkly / Custom Feature Flags, JavaScript SDKs