Build a Bayesian Text Classifier in Ruby from Scratch

Text classification is everywhere. Spam filters, sentiment analysis, content categorization. At the core of many classifiers sits a simple idea: Bayes' theorem.

Let me show you how to build one from scratch in Ruby.

The Math (Keep It Simple)

Bayes' theorem answers this question: Given a message, what's the probability it belongs to category X?

The formula:

P(category | words) = P(category) * P(words | category) / P(words)

In plain English:

  • P(category) - How often does this category appear in training data?
  • P(words | category) - How likely are these words if we're in this category?
  • P(words) - How common are these words overall? (We can ignore this for comparison)

We multiply probabilities for each word. That's it.

The Classifier Class

Here's a working implementation:

class NaiveBayesClassifier
  def initialize
    @word_counts = Hash.new { |h, k| h[k] = Hash.new(0) }
    @category_counts = Hash.new(0)
    @total_docs = 0
    @vocabulary = Set.new
  end

  def train(category, text)
    @total_docs += 1
    @category_counts[category] += 1

    tokenize(text).each do |word|
      @word_counts[category][word] += 1
      @vocabulary.add(word)
    end
  end

  def classify(text)
    scores = {}

    @category_counts.keys.each do |category|
      scores[category] = score_for(category, text)
    end

    scores.max_by { |_, score| score }.first
  end

  def probabilities(text)
    scores = {}
    @category_counts.keys.each do |category|
      scores[category] = score_for(category, text)
    end

    # Convert log probabilities to actual probabilities
    max_score = scores.values.max
    exp_scores = scores.transform_values { |s| Math.exp(s - max_score) }
    total = exp_scores.values.sum

    exp_scores.transform_values { |s| (s / total * 100).round(2) }
  end

  private

  def tokenize(text)
    text.downcase.gsub(/[^a-z0-9\s]/, '').split
  end

  def score_for(category, text)
    # Prior probability: P(category)
    prior = Math.log(@category_counts[category].to_f / @total_docs)

    # Likelihood: P(words | category)
    words = tokenize(text)
    likelihood = words.sum { |word| word_probability(category, word) }

    prior + likelihood
  end

  def word_probability(category, word)
    # Laplace smoothing: add 1 to avoid zero probabilities
    word_count = @word_counts[category][word] + 1
    total_words = @word_counts[category].values.sum + @vocabulary.size

    Math.log(word_count.to_f / total_words)
  end
end

Why Use Logarithms?

Multiplying many small probabilities causes underflow. Numbers get too small.

Log fixes this. Instead of multiplying:

0.001 * 0.002 * 0.0001 = very tiny number

We add logs:

log(0.001) + log(0.002) + log(0.0001) = manageable negative number

Same result, no underflow.

Laplace Smoothing

What if a word never appeared in training? Zero probability. That kills the entire calculation.

Laplace smoothing adds 1 to every word count. No zeros. Problem solved.

word_count = @word_counts[category][word] + 1
total_words = @word_counts[category].values.sum + @vocabulary.size

Building a Spam Filter

Let's put it to work:

require 'set'

# Create and train the classifier
spam_filter = NaiveBayesClassifier.new

# Train with spam examples
spam_filter.train(:spam, "Buy cheap watches now")
spam_filter.train(:spam, "Congratulations you won a million dollars")
spam_filter.train(:spam, "Click here for free money")
spam_filter.train(:spam, "Limited time offer buy now")
spam_filter.train(:spam, "You have been selected for a prize")
spam_filter.train(:spam, "Make money fast work from home")
spam_filter.train(:spam, "Free trial no credit card needed")

# Train with ham (not spam) examples
spam_filter.train(:ham, "Hey are you coming to the meeting")
spam_filter.train(:ham, "Can you review this pull request")
spam_filter.train(:ham, "The project deadline is next Friday")
spam_filter.train(:ham, "Let me know when you're free to chat")
spam_filter.train(:ham, "Thanks for your help yesterday")
spam_filter.train(:ham, "Please find the attached document")
spam_filter.train(:ham, "Meeting rescheduled to 3pm tomorrow")

# Test classification
test_messages = [
  "Claim your free prize now",
  "Can we schedule a call tomorrow",
  "You won a lottery click here",
  "Thanks for sending the report"
]

test_messages.each do |msg|
  result = spam_filter.classify(msg)
  probs = spam_filter.probabilities(msg)
  puts "Message: #{msg}"
  puts "Classification: #{result}"
  puts "Confidence: spam=#{probs[:spam]}%, ham=#{probs[:ham]}%"
  puts "---"
end

Output:

Message: Claim your free prize now
Classification: spam
Confidence: spam=87.34%, ham=12.66%
---
Message: Can we schedule a call tomorrow
Classification: ham
Confidence: spam=23.45%, ham=76.55%
---
Message: You won a lottery click here
Classification: spam
Confidence: spam=91.23%, ham=8.77%
---
Message: Thanks for sending the report
Classification: ham
Confidence: spam=18.92%, ham=81.08%
---

Multi-Category Classification

The same classifier handles any number of categories:

sentiment = NaiveBayesClassifier.new

# Positive reviews
sentiment.train(:positive, "This product is amazing love it")
sentiment.train(:positive, "Best purchase ever highly recommend")
sentiment.train(:positive, "Excellent quality great service")
sentiment.train(:positive, "Works perfectly very happy")

# Negative reviews
sentiment.train(:negative, "Terrible product waste of money")
sentiment.train(:negative, "Broke after one day horrible quality")
sentiment.train(:negative, "Would not recommend very disappointed")
sentiment.train(:negative, "Worst purchase ever avoid")

# Neutral reviews
sentiment.train(:neutral, "Product works as described nothing special")
sentiment.train(:neutral, "Average quality okay for the price")
sentiment.train(:neutral, "Does the job neither good nor bad")

# Test
review = "This is absolutely terrible do not buy"
puts sentiment.classify(review)  # => :negative
puts sentiment.probabilities(review).inspect

Improving Accuracy

The basic classifier works. Here's how to make it better:

1. Better Tokenization

def tokenize(text)
  text
    .downcase
    .gsub(/[^a-z0-9\s]/, '')  # Remove punctuation
    .split
    .reject { |w| w.length < 2 }  # Skip single chars
    .reject { |w| STOP_WORDS.include?(w) }  # Skip common words
end

STOP_WORDS = %w[the a an is are was were be been being have has had do does did will would could should may might must shall can].to_set

2. N-grams

Single words miss context. "not good" should differ from "good".

def ngrams(text, n = 2)
  words = tokenize(text)
  words.each_cons(n).map { |gram| gram.join(' ') }
end

# Combine unigrams and bigrams
def extract_features(text)
  tokenize(text) + ngrams(text, 2)
end

3. Training from Files

def train_from_file(category, filepath)
  File.readlines(filepath).each do |line|
    train(category, line.strip) unless line.strip.empty?
  end
end

# Usage
classifier.train_from_file(:spam, 'data/spam.txt')
classifier.train_from_file(:ham, 'data/ham.txt')

Testing Your Classifier

Split your data. Train on 80%. Test on 20%.

def evaluate(test_data)
  correct = 0
  total = test_data.size

  test_data.each do |expected_category, text|
    predicted = classify(text)
    correct += 1 if predicted == expected_category
  end

  accuracy = (correct.to_f / total * 100).round(2)
  puts "Accuracy: #{accuracy}% (#{correct}/#{total})"
end

# Example test data
test_data = [
  [:spam, "Win a free vacation today"],
  [:ham, "Meeting at 2pm in room 301"],
  [:spam, "Click here to claim prize"],
  [:ham, "The report is attached"]
]

spam_filter.evaluate(test_data)

Complete Working Example

Here's everything together:

require 'set'

class NaiveBayesClassifier
  STOP_WORDS = %w[the a an is are was were be been being have has had do does did will would could should may might must shall can].to_set

  def initialize
    @word_counts = Hash.new { |h, k| h[k] = Hash.new(0) }
    @category_counts = Hash.new(0)
    @total_docs = 0
    @vocabulary = Set.new
  end

  def train(category, text)
    @total_docs += 1
    @category_counts[category] += 1

    extract_features(text).each do |word|
      @word_counts[category][word] += 1
      @vocabulary.add(word)
    end
  end

  def classify(text)
    scores = {}
    @category_counts.keys.each do |category|
      scores[category] = score_for(category, text)
    end
    scores.max_by { |_, score| score }.first
  end

  def probabilities(text)
    scores = {}
    @category_counts.keys.each do |category|
      scores[category] = score_for(category, text)
    end

    max_score = scores.values.max
    exp_scores = scores.transform_values { |s| Math.exp(s - max_score) }
    total = exp_scores.values.sum

    exp_scores.transform_values { |s| (s / total * 100).round(2) }
  end

  private

  def tokenize(text)
    text
      .downcase
      .gsub(/[^a-z0-9\s]/, '')
      .split
      .reject { |w| w.length < 2 }
      .reject { |w| STOP_WORDS.include?(w) }
  end

  def extract_features(text)
    words = tokenize(text)
    bigrams = words.each_cons(2).map { |pair| pair.join(' ') }
    words + bigrams
  end

  def score_for(category, text)
    prior = Math.log(@category_counts[category].to_f / @total_docs)
    words = extract_features(text)
    likelihood = words.sum { |word| word_probability(category, word) }
    prior + likelihood
  end

  def word_probability(category, word)
    word_count = @word_counts[category][word] + 1
    total_words = @word_counts[category].values.sum + @vocabulary.size
    Math.log(word_count.to_f / total_words)
  end
end

# Run it
if __FILE__ == $0
  classifier = NaiveBayesClassifier.new

  # Training data
  classifier.train(:spam, "Buy cheap watches now special offer")
  classifier.train(:spam, "Congratulations you won click here")
  classifier.train(:spam, "Free money make cash fast")
  classifier.train(:ham, "Meeting tomorrow at 10am")
  classifier.train(:ham, "Please review the attached document")
  classifier.train(:ham, "Thanks for your help with the project")

  # Test
  test = "Click here to win free cash"
  puts "Input: #{test}"
  puts "Category: #{classifier.classify(test)}"
  puts "Probabilities: #{classifier.probabilities(test)}"
end

When to Use Naive Bayes

Good for:

  • Spam filtering
  • Sentiment analysis
  • Document categorization
  • Any problem with clear word-to-category relationships

Not great for:

  • Complex relationships between words
  • Small training datasets
  • Cases where word order matters a lot

Wrapping Up

You just built a text classifier from scratch. No external libraries. Pure Ruby.

The "naive" in Naive Bayes assumes words are independent. They're not. But the classifier still works surprisingly well in practice.

Start simple. Add training data. Test accuracy. Improve tokenization. That's the workflow.

The code is yours. Modify it. Break it. Make it better.