📧 AI / CRM

103,000 Emails, Zero Manual Triage:
AI-Powered Email Intelligence

An email archive of 103K+ messages needed to become a searchable, AI-classified intelligence system. The challenges weren't just AI — they were database architecture. A single query pattern was taking 2.2 seconds per lookup because of a missing index on a derived value. Here's how we built it right.

103K+Emails Indexed
2.2s→<1msDomain Lookup
CoveringIndex Dashboard
GeminiAuto-Classify

The Challenge

The client had a email_maildir_ingest table with 103K+ email records — messages ingested from multiple IMAP mailboxes over time. The table was being used for both storage and as the backbone of a customer intelligence system: which contacts were emailing us, what were they asking about, and which ones needed follow-up.

The CRM dashboard was loading slowly. Individual contact lookups were taking 2+ seconds. The AI classification pipeline was processing emails but the results weren't being surfaced efficiently. Each problem had a distinct technical root cause.

The Domain Lookup Problem: 2.2 Seconds

The contact discovery feature works by finding email senders from the same domain as known contacts. For example: if a contact at acme.com is in the CRM, find all emails from other acme.com addresses in the archive.

The initial query used LIKE '%@acme.com' — a leading wildcard pattern that prevents any index from being used. MySQL has to read every row and evaluate the pattern on each one:

Before — 2.2s, full scan of 722K rows
SELECT * FROM contacts
WHERE email
  LIKE '%@acme.com'

-- EXPLAIN:
-- type: ALL
-- rows: 722,431
-- Extra: Using where
After — <1ms, index lookup
SELECT * FROM contacts
WHERE email_domain
  = 'acme.com'

-- EXPLAIN:
-- type: ref
-- rows: 3
-- Extra: Using index

Stored Generated Column

The solution is a MySQL stored generated column. Instead of computing the domain from the email address at query time, MySQL computes it once when each row is inserted or updated — and stores the result. Combined with a regular index on the generated column, lookups become instant:

MySQL — Stored generated column for email domain
-- Add the generated column (STORED = computed once, saved to disk)
ALTER TABLE email_maildir_ingest
  ADD COLUMN email_domain VARCHAR(100)
  GENERATED ALWAYS AS (
    SUBSTRING_INDEX(`from_address`, '@', -1)
  ) STORED,
  ADD INDEX idx_email_domain (email_domain)
  ALGORITHM=INPLACE LOCK=NONE;

-- The same column on contacts table
ALTER TABLE contacts
  ADD COLUMN email_domain VARCHAR(100)
  GENERATED ALWAYS AS (
    SUBSTRING_INDEX(email, '@', -1)
  ) STORED,
  ADD INDEX idx_sender_domain (email_domain)
  ALGORITHM=INPLACE LOCK=NONE;

-- Domain lookup: 2.2s → 0.0003s (one index row vs 722K full scan)

STORED vs VIRTUAL: We chose STORED (persisted to disk) over VIRTUAL (computed on read) because this column is used in an index and in JOIN operations. Stored generated columns can be indexed; virtual ones generally cannot. The storage cost is negligible — VARCHAR(100) per row on 103K rows is ~10MB.

Covering Index for Dashboard Stats

The CRM dashboard shows per-employee email statistics: total emails, unread counts, emails by label, emails by urgency. These queries run on the email_labels table, which had 92K rows for a single employee (emp_id=3).

The original query was doing a full table scan — MySQL's EXPLAIN showed Using where with 92,000 rows examined. The fix was a covering index: an index that includes all the columns the query needs, so MySQL never has to touch the actual table rows at all. EXPLAIN then shows Using index — the entire query is answered from the index B-tree:

MySQL — Covering index for dashboard stats
-- Dashboard stats query
SELECT label, COUNT(*) as cnt
FROM email_labels
WHERE emp_id = 3
GROUP BY label;

-- Before: full scan, 92K rows examined
-- EXPLAIN: type=ALL, rows=92000, Extra=Using where

-- Covering index: includes all columns the query touches
CREATE INDEX idx_labels_cover
  ON email_labels (emp_id, label)
  ALGORITHM=INPLACE LOCK=NONE;

-- After: index-only scan
-- EXPLAIN: type=ref, rows=1, Extra=Using index
BEFORE:  type=ALL   key=NULL   rows=92,431   Extra=Using where
AFTER:   type=ref    key=idx_labels_cover   rows=1   Extra=Using index

Contact Discovery

With the domain index in place, we built the contact discovery feature: when viewing an email from an unknown sender, the system automatically checks if any other senders from the same domain are in the contacts table. If yes, it suggests adding this person to the existing company record.

The query is now a simple join on two indexed email_domain columns — millisecond response time even as both tables grow:

MySQL — Contact discovery from email domain
-- Find unknown senders from same domain as existing contacts
SELECT DISTINCT
  e.from_address,
  e.from_name,
  c.company_name
FROM email_maildir_ingest e
JOIN contacts c
  ON e.email_domain = c.email_domain -- both indexed
LEFT JOIN contacts known
  ON e.from_address = known.email
WHERE known.contact_id IS NULL -- not already in CRM
  AND e.email_domain = 'acme.com';

-- Returns unknown senders from known companies instantly

Gemini AI Classification

Every new email ingested into the system is sent to Gemini for classification. The AI extracts:

  • Category: Sales inquiry, support request, invoice, spam, partnership, complaint, etc.
  • Urgency: critical / high / medium / low
  • Sentiment: positive / neutral / negative
  • Action needed: Free-text description of what response is required
  • Key entities: Order numbers, product names, dollar amounts extracted automatically

Classification results are stored in a separate email_ai_labels table linked to the original email. This allows re-classification with updated models without modifying the source data.

Body Cleanup for Display

Emails stored in the archive are raw MIME — HTML bodies with full CSS stylesheets, tracking pixels, marketing layouts. Displaying raw HTML in the CRM would be a security risk (XSS) and would look terrible. Processing it for the AI with all the markup noise would waste tokens and reduce accuracy.

The cleanup pipeline strips all CSS and HTML, converting the email to clean plain text for both display and AI processing. A DOMParser-based approach handles well-formed HTML; a regex fallback handles malformed markup from older email clients.

Results

MetricBeforeAfter
Domain lookup query2.2s (722K scan)<1ms (1 row via index)
Dashboard stats queryTable scan, 92K rowsUsing index, 1 row
Contact discoveryNot possibleInstant, both tables indexed
Email classificationManualAI-automated (Gemini)
EXPLAIN type (labels)ALLref
EXPLAIN Extra (labels)Using whereUsing index
Email body in UIRaw HTML (broken)Clean plain text

Tech Stack

Gemini 2.0 Flash API • MySQL 8.0 (InnoDB generated columns) • PHP 8.1 • IMAP • MIME parser • DOMParser • JavaScript (vanilla) • EXPLAIN ANALYZE

Sitting on a Mountain of Unread Emails?

We can build an AI triage system on your existing email archive — without disrupting your current workflow. Let's talk.

Get AI Email Quote View All Case Studies