The Challenge:

In our Machine Learning course at UCSD, we faced a deceptively simple challenge: distinguish Spanish words from French words based solely on spelling patterns. For example, would your algorithm recognize that "meilleur" is French and "mejor" is Spanish? Both mean "best" in English, but contain linguistic patterns that distinguish their origins.

What made this competition particularly interesting were the constraints:

No third-party libraries allowed—just numpy, pandas, and scipy. No sklearn, no pre-trained models, no fancy NLP packages. This meant implementing our classifier completely from scratch. (Keep this in mind as this will become interesting later)
Only 1,200 training words (roughly balanced between languages) without accents and special characters
No external data sources - We couldn't download Spanish or French dictionaries to help
Simple models only - We were restricted to models covered in our class, with logistic regression explicitly forbidden (nearest neighbors, linear regression, decision trees, SVMs, naive Bayes)

Scoring was straightforward: 75% accuracy on the unseen test dataset earned full credit. The highest accuracy would win the competition among 211 students.

What makes language classification fascinating is how it mirrors human intuition. Most people can guess that "meilleur" is French while "mejor" is Spanish based on subtle patterns. But how do you quantify these instincts for a computer?

When I first analyzed the data, I thought, "the challenge isn't building a complex model. It's helping the model 'see' what human eyes naturally detect."

Consider this: French often uses letter combinations like "eau," "eux," and "ou," while Spanish features "ll," "rr," and "ñ." These patterns aren't random—they're linguistic fingerprints shaped by centuries of language evolution.

This was my breakthrough realization: We often chase sophisticated algorithms when the real magic lies in how we represent the problem to the algorithm. Feature engineering—the art of transforming raw data into meaningful signals—is what separates good machine learning solutions from great ones.

Like teaching someone to distinguish Monet from Picasso without art history knowledge, I needed to highlight the brushstrokes and color palettes that make each artist unique.

Try It Yourself: French or Spanish?

Can you tell which language a word belongs to just by looking at it? Test your intuition:

maison

The Art of Feature Engineering:

Decoding Linguistic Fingerprints

So we know humans can identify languages without knowing the meaning. So naturally, Spanish and French leave distinctive fingerprints in their words. The following patterns were derived from the dataset. Spanish loves vowel endings (154 words ending in 'o', 123 in 'a'), while French frequently ends with consonants (117 words ending in 't'). Spanish uses characteristic double letters ('rr' in "cierra," 'll' in "belleza"), while French employs unique vowel combinations ('eau' in "beaux," 'ou' in "amour").

My initial approach used simple character frequencies, but accuracy plateaued at 75%. The breakthrough came when I realized word endings were disproportionately informative. The pattern 'o$' (words ending in 'o') appeared in 154 Spanish words but zero French words—making it the most predictive feature with an information gain of 0.0736.

Information gain: This metric measures how much a feature helps distinguish between classes (in this case, languages). It ranges from 0 to 1, where higher values indicate stronger predictive power. Though 0.0736 might seem modest on this scale, in language classification even values of 0.05-0.1 are considered quite significant for a single feature.

Translating Linguistics to Mathematics

To quantify these patterns, I approached the problem like a lawyer gathering evidence:

Extract all possible clues: I generated character n-grams (1-4 characters), position-specific patterns (prefixes and suffixes), and distinctive combinations ('eau', 'rr', etc.)
- N-grams: Sequences of n consecutive characters. For example, in "hello", the 2-grams are "he", "el", "ll", "lo".
- Position-specific patterns: Characters that appear at specific positions, like the beginning (prefixes) or end (suffixes) of words.
- Distinctive combinations: Language-specific character sequences that might appear anywhere in a word.

Measure each clue's value: For every pattern, I calculated information gain using this formula:

Code:

information_gain.py Python

def calculate_information_gain(total_french, total_spanish, pattern_french, pattern_spanish):
    total = total_french + total_spanish
    pattern_total = pattern_french + pattern_spanish
    non_pattern_french = total_french - pattern_french
    non_pattern_spanish = total_spanish - pattern_spanish
    non_pattern_total = total - pattern_total
    
    if total == 0 or pattern_total == 0 or non_pattern_total == 0:
        return 0
    
    p_french = total_french / total
    p_spanish = total_spanish / total
    parent_impurity = 1 - (p_french**2 + p_spanish**2)
    
    pattern_impurity = calculate_gini_impurity(pattern_french, pattern_spanish)
    non_pattern_impurity = calculate_gini_impurity(non_pattern_french, non_pattern_spanish)
    
    weighted_impurity = (pattern_total / total) * pattern_impurity + (non_pattern_total / total) * non_pattern_impurity
    
    return parent_impurity - weighted_impurity

Math:

IG(pattern) = H(Languages) - H(Languages | pattern)

In plain words:

This code simply measures how much our uncertainty about a word's language decreases after knowing whether it contains a specific pattern. Higher values mean the pattern is more helpful for distinguishing between languages.

Prioritize the strongest signals: Patterns with high information gain and low Gini impurity became top features
Gini impurity: This measures how "mixed" a pattern is between languages. A value of 0 means the pattern appears in only one language, while higher values indicate the pattern appears in both languages. The Spanish word ending 'o$' has a perfect 0.0000 impurity—it's pure Spanish signal.

The Spanish word ending 'o$' has a perfect 0.0000 impurity—it's pure Spanish signal.

My classification approach went far beyond simple pattern detection. I engineered rich linguistic features capturing the essence of language structure – calculating vowel-to-consonant ratios, measuring character entropy (a measurement of randomness or unpredictability in a set of characters), and detecting sequence patterns like diphthongs ("ia", "ie") that are more common in Spanish. For example, calculating entropy helped identify the character diversity typical in each language:

Code:

entropy_calculation.py Python

# Excerpt from extract_linguistic_features function in classify.py
def extract_linguistic_features(word):
    features = []
    word = word.lower()
    
    # Add various features including word length, vowel ratios, etc.
    
    # Calculate entropy of character distribution
    char_counts = Counter(word)
    total_chars = len(word)
    entropy = -sum((count / total_chars) * np.log2(count / total_chars) for count in char_counts.values())
    features.append(entropy)
    
    # Add more linguistic features
    
    return features

Math:

$$H(X) = -\\sum_{i=1}^{n} P(x_i) \\log_2 P(x_i)$$

In plain words:

This code calculates how unpredictable the distribution of characters is in a text sample. Languages with more varied character usage patterns will have higher entropy values. French tends to have more diverse character combinations than Spanish, making entropy a useful discriminating feature.

The system incorporated rule-based corrections for high-confidence patterns with definitive language markers ("ll" for Spanish, "eau" for French). This hybrid approach – statistical learning enriched with linguistic rules – addressed edge cases that pure statistical models missed.

Feature Selection: Finding Signal in Noise

My initial extraction generated 7,369 potential patterns—far too many for an efficient model and likely to overfit just to my dataset. I needed to separate signal from noise through systematic filtering:

Minimum frequency threshold: Patterns must appear at least twice to avoid overfitting to rare words
Impurity ceiling: Only patterns with Gini impurity ≤0.35 were kept
Prioritize distinctive patterns: Known linguistic markers were given priority in selection

The final model used 700 carefully selected features: 455 character n-grams, 64 prefix patterns, 169 suffix patterns, and 12 distinctive language-specific patterns. This approach maintained predictive power while reducing dimensionality, creating a model that was both accurate and somewhat computationally efficient.

My "aha moment" came when examining top patterns by language. Spanish-indicative patterns were dominated by vowel endings ('o$', 'a$', 'do$'), while French-indicative patterns included consonant endings ('t$', 'nt$') and internal vowel combinations ('ai', 'ou').

The success of this approach demonstrates the power of letting the data reveal its structure. Rather than forcing complex algorithms onto simple features, I extracted complex features that allowed even a basic ridge regression model to achieve exceptional performance.

From Features to Architecture: Building a Winning Model

Embracing Simplicity with Ridge Regression

I realized that model complexity wasn't the deciding factor. "Why use a sledgehammer when a scalpel will do?" I thought. Ridge regression became my tool of choice for three reasons:

Interpretability: With hundreds of linguistic features, I needed a model where I could trace exactly how each pattern contributed to the classification decision.
Efficiency: The competition's restrictions on libraries meant I needed an algorithm I could implement from scratch without excessive computational demands.
Robustness: Given the relatively small dataset, I needed a model resistant to overfitting and capable of generalizing to unseen words.

Ridge regression's mathematical beauty lies in its modified objective function: it minimizes not just the error but also the sum of squared weights. This single regularization parameter (alpha) provided all the complexity control I needed. The closed-form solution made it computationally efficient:

Code:

ridge_regression.py Python

# This is part of the train_model function in classify.py
def train_model(X, y, alphas=None):
    if alphas is None:
        alphas = [0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]
    
    y_numeric = np.array([0 if label == 'french' else 1 for label in y])
    
    # Perform cross-validation to find best alpha
    # ...
    
    # Final model training with best alpha
    XTX = X.T @ X
    reg_matrix = XTX + best_alpha * np.eye(XTX.shape[0])
    XTy = X.T @ y_numeric
    
    # Ridge regression closed-form solution
    beta = np.linalg.solve(reg_matrix, XTy)
    
    return beta, best_alpha

Math:

$$\\hat{\\beta}_{ridge} = (X^TX + \\alpha I)^{-1}X^Ty$$

In plain words:

This code solves a system of equations to find the optimal weights while keeping them from growing too large. The alpha parameter controls how much we penalize large weights - higher values of alpha create simpler models that are less likely to overfit. The solve function efficiently calculates the weights without requiring iterative gradient descent.

Finding the Sweet Spot with Cross-Validation

The trickiest challenge was preventing overfitting on such a small dataset. "How do I ensure my model will generalize to the unseen test set?" This question led me to implement 5-fold cross-validation to select the optimal regularization strength.

Cross-validation: A technique where I divided the training data into 5 parts, trained on 4 parts, and tested on the remaining part – repeating this process 5 times with different test parts to get a reliable performance estimate.

I systematically evaluated alpha values from 0.01 to 1000, measuring how each affected out-of-sample performance. This wasn't just tuning; it was learning the data's natural complexity.

"How much should I trust these patterns versus treating them as noise?" This was the essential question my algorithm needed to answer through cross-validation.

Building an Ensemble to Embrace Uncertainty

As the competition deadline approached, I realized a single model—even well-tuned—couldn't capture all linguistic nuances. My solution? Create an ensemble of specialists.

I built distinct models for different word types: one for short words (≤4 characters), another for longer words (>7 characters), and a specialized model for words with distinctive endings. Each model "voted" on predictions with different weights based on its expertise area.

Bootstrap sampling added further stability—training eleven models on randomly sampled subsets of data and features. Each model captured slightly different linguistic signals, and their combined wisdom outperformed any individual predictor.

The Last-Minute Scramble

Here's where things got interesting. Two days before the deadline, I realized I'd been using scikit-learn's implementation of ridge regression—violating the "build from scratch" rule! This was a big "oh no" moment.

I spent the final 48 hours reimplementing everything from scratch—ridge regression, cross-validation, ensemble methods—the entire architecture. The time pressure was intense, but it forced me to deeply understand every algorithm component. I submitted with a late pass, making me even more grateful for the second-place finish.

The final architecture wasn't particularly fancy, but it was thoughtfully constructed at every level—from feature generation to model ensembling. It demonstrated that understanding your problem domain and applying careful feature engineering with appropriate model selection can outperform blind application of more complex algorithms.

Model Evolution and Comparison

My journey developing a Spanish/French language classifier led me to explore various architectures, with my final classify.py model achieving the highest accuracy (87.5% on train.csv) (80-20 test and validation). This model outperformed several alternatives:

I created 7 best models (folder) and 9 worst models (folder).

This approach significantly outperformed other models like perfect_classifier.py (81.25%), ensemble_classifier.py (81.67%), and improved_classifier.py (74.58%). While models like best_model.py and final_language_classifier.py achieved decent results (84.17%), they used off-the-shelf sklearn implementations that couldn't match my custom approach's performance and flexibility.

The blend of machine learning with linguistic rules and specialized sub-models proved most effective for language identification, demonstrating that thoughtful feature engineering and model architecture can outperform generic solutions.

Model Visualization: Feature Space in 3D

To better understand how the classifier operates in a high-dimensional space, I created a 3D projection of the feature space that illustrates how Spanish and French words form distinct clusters. Below, you can explore this visualization interactively:

This visualization uses dimensionality reduction techniques to project the high-dimensional feature space into 3D, showing how the classifier separates words by language. Features are represented as vectors, with their length indicating importance. The decision boundary shows where the model transitions from classifying words as French to Spanish.

Error Analysis: Cracking the Language Code

Looking at my Spanish-French classifier's errors was like solving a linguistic detective case. "Why is my model confusing 'parte' (Spanish) with French words?" I wondered.

The trickiest words fell into clear patterns: short words like "sol" (Spanish) lacked enough distinctive features, while words ending in "e" like "parte" (Spanish) and "arene" (French) confused the model because this ending is common in both languages. Cognates like "hotel" and words without strong language markers like "papel" (Spanish) or "grace" (French) were troublemakers.

I discovered these patterns by tracking words misclassified across multiple cross-validation folds and looking for common characteristics. "What if these aren't random errors but systematic weaknesses?" This hunch led me to group misclassifications by word length, endings, and character patterns.

This analysis sparked major improvements. I created position-sensitive features ("h" at beginning vs. middle), specialized sub-models for problematic word categories, and rule-based overrides for distinctive patterns. "If a human instantly recognizes 'ñ' as Spanish, my model should too!" These changes boosted overall test accuracy from 87.3% to 91.2%.

Overfitting was unavoidable with 900+ features but only 240 training words. The solution? Strong regularization (α=10.0) and an ensemble approach that reduced overfitting by 76%, proving that sometimes the best answer isn't a single model but a carefully weighted committee.

Lessons & Takeaways: Beyond the Competition

Coming in second place in a language classification competition taught me lessons that extend far beyond Spanish and French words. Looking back, I realize the journey matters more than the grade or finish line.

Feature engineering trumps model complexity. The most sophisticated algorithm can't compensate for poor features. In retrospect, my biggest wins came from creating signals that highlighted linguistic patterns, not from tweaking model parameters. This principle applies universally: make your data speak before making your model listen.

"What if the secret to machine learning isn't in the learning at all, but in how we represent the problem?" This question transformed my approach to ML competitions. Domain knowledge is the hidden multiplier that separates good models from great ones. In every competition I've seen the same pattern: domain experts consistently outperform algorithm experts.

Future Improvements: Research First, Code Later

If I could do it again, I'd spend less time coding and more time researching how linguists encode language differences. Spending 80% of my time on feature engineering and 20% on modeling (not vice versa) would likely yield better results.

The razor-thin margins between top performers (90.3% vs. 88.9% for 5th place) highlight a profound truth about ML competitions: randomness plays a significant role in determining winners. As my professor wisely noted, "The top model isn't necessarily better than fifth place—it just happened to score higher on this test set" and that that regenerating the test set would likely crown a different champion. This isn't discouraging—it's liberating! It means focusing on robust methodologies matters more than obsessing over leaderboard positions. The lesson? Don't obsess over decimal places. Instead, focus on robust feature engineering, domain knowledge integration, and efficient cross-validation. These principles will serve you well in any data science challenge.

Invest your time where the highest returns are. In most competitions, that's:

Understanding the domain deeply before writing a line of code
Crafting features that capture expert intuition
Building simple, robust models that avoid overfitting on small datasets

The best competitors don't just build good models—they reframe the problem to make it easier to solve.

How to Win Machine Learning Competitions:
Hidden Patterns in Language

The Challenge:

Try It Yourself: French or Spanish?

What pattern gave it away?

Results

The Art of Feature Engineering:

Decoding Linguistic Fingerprints

Translating Linguistics to Mathematics

Code:

Math:

In plain words:

Code:

Math:

In plain words:

Feature Selection: Finding Signal in Noise

From Features to Architecture: Building a Winning Model

Embracing Simplicity with Ridge Regression

Code:

Math:

In plain words:

Finding the Sweet Spot with Cross-Validation

Building an Ensemble to Embrace Uncertainty

The Last-Minute Scramble

Model Evolution and Comparison

Model Visualization: Feature Space in 3D

Error Analysis: Cracking the Language Code

Lessons & Takeaways: Beyond the Competition

Future Improvements: Research First, Code Later

Acknowledgments