I told my professor I would win the competition. I didn't. I came in 2nd out of 211 students. But the near-victory taught me something far more valuable than a first-place finish: what truly separates good machine learning solutions from great ones.
A 3D visualization of the language classifier model I developed, showing French (blue) and Spanish (gold) word clusters separated by a decision boundary. A detailed interactive version appears later in this article.
In our Machine Learning course at UCSD, we faced a deceptively simple challenge: distinguish Spanish words from French words based solely on spelling patterns. For example, would your algorithm recognize that "meilleur" is French and "mejor" is Spanish? Both mean "best" in English, but contain linguistic patterns that distinguish their origins.
What made this competition particularly interesting were the constraints:
Scoring was straightforward: 75% accuracy on the unseen test dataset earned full credit. The highest accuracy would win the competition among 211 students.
What makes language classification fascinating is how it mirrors human intuition. Most people can guess that "meilleur" is French while "mejor" is Spanish based on subtle patterns. But how do you quantify these instincts for a computer?
When I first analyzed the data, I thought, "the challenge isn't building a complex model. It's helping the model 'see' what human eyes naturally detect."
Consider this: French often uses letter combinations like "eau," "eux," and "ou," while Spanish features "ll," "rr," and "ñ." These patterns aren't random—they're linguistic fingerprints shaped by centuries of language evolution.
This was my breakthrough realization: We often chase sophisticated algorithms when the real magic lies in how we represent the problem to the algorithm. Feature engineering—the art of transforming raw data into meaningful signals—is what separates good machine learning solutions from great ones.
Like teaching someone to distinguish Monet from Picasso without art history knowledge, I needed to highlight the brushstrokes and color palettes that make each artist unique.
Can you tell which language a word belongs to just by looking at it? Test your intuition:
So we know humans can identify languages without knowing the meaning. So naturally, Spanish and French leave distinctive fingerprints in their words. The following patterns were derived from the dataset. Spanish loves vowel endings (154 words ending in 'o', 123 in 'a'), while French frequently ends with consonants (117 words ending in 't'). Spanish uses characteristic double letters ('rr' in "cierra," 'll' in "belleza"), while French employs unique vowel combinations ('eau' in "beaux," 'ou' in "amour").
My initial approach used simple character frequencies, but accuracy plateaued at 75%. The breakthrough came when I realized word endings were disproportionately informative. The pattern 'o$' (words ending in 'o') appeared in 154 Spanish words but zero French words—making it the most predictive feature with an information gain of 0.0736.
Information gain: This metric measures how much a feature helps distinguish between classes (in this case, languages). It ranges from 0 to 1, where higher values indicate stronger predictive power. Though 0.0736 might seem modest on this scale, in language classification even values of 0.05-0.1 are considered quite significant for a single feature.
To quantify these patterns, I approached the problem like a lawyer gathering evidence:
def calculate_information_gain(total_french, total_spanish, pattern_french, pattern_spanish):
total = total_french + total_spanish
pattern_total = pattern_french + pattern_spanish
non_pattern_french = total_french - pattern_french
non_pattern_spanish = total_spanish - pattern_spanish
non_pattern_total = total - pattern_total
if total == 0 or pattern_total == 0 or non_pattern_total == 0:
return 0
p_french = total_french / total
p_spanish = total_spanish / total
parent_impurity = 1 - (p_french**2 + p_spanish**2)
pattern_impurity = calculate_gini_impurity(pattern_french, pattern_spanish)
non_pattern_impurity = calculate_gini_impurity(non_pattern_french, non_pattern_spanish)
weighted_impurity = (pattern_total / total) * pattern_impurity + (non_pattern_total / total) * non_pattern_impurity
return parent_impurity - weighted_impurity
This code simply measures how much our uncertainty about a word's language decreases after knowing whether it contains a specific pattern. Higher values mean the pattern is more helpful for distinguishing between languages.
The Spanish word ending 'o$' has a perfect 0.0000 impurity—it's pure Spanish signal.
My classification approach went far beyond simple pattern detection. I engineered rich linguistic features capturing the essence of language structure – calculating vowel-to-consonant ratios, measuring character entropy (a measurement of randomness or unpredictability in a set of characters), and detecting sequence patterns like diphthongs ("ia", "ie") that are more common in Spanish. For example, calculating entropy helped identify the character diversity typical in each language:
# Excerpt from extract_linguistic_features function in classify.py
def extract_linguistic_features(word):
features = []
word = word.lower()
# Add various features including word length, vowel ratios, etc.
# Calculate entropy of character distribution
char_counts = Counter(word)
total_chars = len(word)
entropy = -sum((count / total_chars) * np.log2(count / total_chars) for count in char_counts.values())
features.append(entropy)
# Add more linguistic features
return features
This code calculates how unpredictable the distribution of characters is in a text sample. Languages with more varied character usage patterns will have higher entropy values. French tends to have more diverse character combinations than Spanish, making entropy a useful discriminating feature.
The system incorporated rule-based corrections for high-confidence patterns with definitive language markers ("ll" for Spanish, "eau" for French). This hybrid approach – statistical learning enriched with linguistic rules – addressed edge cases that pure statistical models missed.
My initial extraction generated 7,369 potential patterns—far too many for an efficient model and likely to overfit just to my dataset. I needed to separate signal from noise through systematic filtering:
The final model used 700 carefully selected features: 455 character n-grams, 64 prefix patterns, 169 suffix patterns, and 12 distinctive language-specific patterns. This approach maintained predictive power while reducing dimensionality, creating a model that was both accurate and somewhat computationally efficient.
My "aha moment" came when examining top patterns by language. Spanish-indicative patterns were dominated by vowel endings ('o$', 'a$', 'do$'), while French-indicative patterns included consonant endings ('t$', 'nt$') and internal vowel combinations ('ai', 'ou').
The success of this approach demonstrates the power of letting the data reveal its structure. Rather than forcing complex algorithms onto simple features, I extracted complex features that allowed even a basic ridge regression model to achieve exceptional performance.
I realized that model complexity wasn't the deciding factor. "Why use a sledgehammer when a scalpel will do?" I thought. Ridge regression became my tool of choice for three reasons:
Ridge regression's mathematical beauty lies in its modified objective function: it minimizes not just the error but also the sum of squared weights. This single regularization parameter (alpha) provided all the complexity control I needed. The closed-form solution made it computationally efficient:
# This is part of the train_model function in classify.py
def train_model(X, y, alphas=None):
if alphas is None:
alphas = [0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]
y_numeric = np.array([0 if label == 'french' else 1 for label in y])
# Perform cross-validation to find best alpha
# ...
# Final model training with best alpha
XTX = X.T @ X
reg_matrix = XTX + best_alpha * np.eye(XTX.shape[0])
XTy = X.T @ y_numeric
# Ridge regression closed-form solution
beta = np.linalg.solve(reg_matrix, XTy)
return beta, best_alpha
This code solves a system of equations to find the optimal weights while keeping them from growing too large. The alpha parameter controls how much we penalize large weights - higher values of alpha create simpler models that are less likely to overfit. The solve function efficiently calculates the weights without requiring iterative gradient descent.
The trickiest challenge was preventing overfitting on such a small dataset. "How do I ensure my model will generalize to the unseen test set?" This question led me to implement 5-fold cross-validation to select the optimal regularization strength.
Cross-validation: A technique where I divided the training data into 5 parts, trained on 4 parts, and tested on the remaining part – repeating this process 5 times with different test parts to get a reliable performance estimate.
I systematically evaluated alpha values from 0.01 to 1000, measuring how each affected out-of-sample performance. This wasn't just tuning; it was learning the data's natural complexity.
"How much should I trust these patterns versus treating them as noise?" This was the essential question my algorithm needed to answer through cross-validation.
As the competition deadline approached, I realized a single model—even well-tuned—couldn't capture all linguistic nuances. My solution? Create an ensemble of specialists.
I built distinct models for different word types: one for short words (≤4 characters), another for longer words (>7 characters), and a specialized model for words with distinctive endings. Each model "voted" on predictions with different weights based on its expertise area.
Bootstrap sampling added further stability—training eleven models on randomly sampled subsets of data and features. Each model captured slightly different linguistic signals, and their combined wisdom outperformed any individual predictor.
Here's where things got interesting. Two days before the deadline, I realized I'd been using scikit-learn's implementation of ridge regression—violating the "build from scratch" rule! This was a big "oh no" moment.
I spent the final 48 hours reimplementing everything from scratch—ridge regression, cross-validation, ensemble methods—the entire architecture. The time pressure was intense, but it forced me to deeply understand every algorithm component. I submitted with a late pass, making me even more grateful for the second-place finish.
The final architecture wasn't particularly fancy, but it was thoughtfully constructed at every level—from feature generation to model ensembling. It demonstrated that understanding your problem domain and applying careful feature engineering with appropriate model selection can outperform blind application of more complex algorithms.
My journey developing a Spanish/French language classifier led me to explore various architectures, with my final classify.py model achieving the highest accuracy (87.5% on train.csv) (80-20 test and validation). This model outperformed several alternatives:
I created 7 best models (folder) and 9 worst models (folder).
This approach significantly outperformed other models like perfect_classifier.py (81.25%), ensemble_classifier.py (81.67%), and improved_classifier.py (74.58%). While models like best_model.py and final_language_classifier.py achieved decent results (84.17%), they used off-the-shelf sklearn implementations that couldn't match my custom approach's performance and flexibility.
The blend of machine learning with linguistic rules and specialized sub-models proved most effective for language identification, demonstrating that thoughtful feature engineering and model architecture can outperform generic solutions.
To better understand how the classifier operates in a high-dimensional space, I created a 3D projection of the feature space that illustrates how Spanish and French words form distinct clusters. Below, you can explore this visualization interactively:
This visualization uses dimensionality reduction techniques to project the high-dimensional feature space into 3D, showing how the classifier separates words by language. Features are represented as vectors, with their length indicating importance. The decision boundary shows where the model transitions from classifying words as French to Spanish.
Looking at my Spanish-French classifier's errors was like solving a linguistic detective case. "Why is my model confusing 'parte' (Spanish) with French words?" I wondered.
The trickiest words fell into clear patterns: short words like "sol" (Spanish) lacked enough distinctive features, while words ending in "e" like "parte" (Spanish) and "arene" (French) confused the model because this ending is common in both languages. Cognates like "hotel" and words without strong language markers like "papel" (Spanish) or "grace" (French) were troublemakers.
I discovered these patterns by tracking words misclassified across multiple cross-validation folds and looking for common characteristics. "What if these aren't random errors but systematic weaknesses?" This hunch led me to group misclassifications by word length, endings, and character patterns.
This analysis sparked major improvements. I created position-sensitive features ("h" at beginning vs. middle), specialized sub-models for problematic word categories, and rule-based overrides for distinctive patterns. "If a human instantly recognizes 'ñ' as Spanish, my model should too!" These changes boosted overall test accuracy from 87.3% to 91.2%.
Overfitting was unavoidable with 900+ features but only 240 training words. The solution? Strong regularization (α=10.0) and an ensemble approach that reduced overfitting by 76%, proving that sometimes the best answer isn't a single model but a carefully weighted committee.
Coming in second place in a language classification competition taught me lessons that extend far beyond Spanish and French words. Looking back, I realize the journey matters more than the grade or finish line.
Feature engineering trumps model complexity. The most sophisticated algorithm can't compensate for poor features. In retrospect, my biggest wins came from creating signals that highlighted linguistic patterns, not from tweaking model parameters. This principle applies universally: make your data speak before making your model listen.
"What if the secret to machine learning isn't in the learning at all, but in how we represent the problem?" This question transformed my approach to ML competitions. Domain knowledge is the hidden multiplier that separates good models from great ones. In every competition I've seen the same pattern: domain experts consistently outperform algorithm experts.
If I could do it again, I'd spend less time coding and more time researching how linguists encode language differences. Spending 80% of my time on feature engineering and 20% on modeling (not vice versa) would likely yield better results.
The razor-thin margins between top performers (90.3% vs. 88.9% for 5th place) highlight a profound truth about ML competitions: randomness plays a significant role in determining winners. As my professor wisely noted, "The top model isn't necessarily better than fifth place—it just happened to score higher on this test set" and that that regenerating the test set would likely crown a different champion. This isn't discouraging—it's liberating! It means focusing on robust methodologies matters more than obsessing over leaderboard positions. The lesson? Don't obsess over decimal places. Instead, focus on robust feature engineering, domain knowledge integration, and efficient cross-validation. These principles will serve you well in any data science challenge.
Invest your time where the highest returns are. In most competitions, that's:
The best competitors don't just build good models—they reframe the problem to make it easier to solve.
The full code for this project is available on GitHub.
Special thanks to Professor Justin Eldridge for being the best ML professor one could ask for.
Thanks also to Angel for his invaluable guidance and insights from having taken the class previously.