CamelCase tokenization and the BM25F ceiling

11 min read~200 wpm

The previous article ended with one failure I couldn't tune around. The query RunInstances returned API_RunInstances.md at rank 4 instead of rank 1, even with the title field weighted 3× higher than body. I tried bumping w_title to 5.0. Two queries moved a grade. The rest didn't budge. The math is what the math is — a single-token title hit can't out-vote twenty body hits no matter how heavy you weight the title. The problem was upstream: RunInstances was one token in the index where it should have been three.

The math under the failure

Take API_RunInstances.md. Its title is the literal string RunInstances. After tokenization (lowercase, no splitting), the title field of that doc contains exactly one token: runinstances. dl_title = 1. The corpus avgdl_title is around 4.8.

For a user query RunInstances, the query becomes the term runinstances. The title contribution for this doc is:

norm_tf_title = 1 / (1 − 0.3 + 0.3 · 1.0/4.8)
              = 1 / (0.7 + 0.0625)
              = 1.31

w_title · norm_tf_title = 5.0 · 1.31 = 6.55

Now take ExamplePolicies_EC2.md — a doc that mentions RunInstances twenty times in a long body explaining IAM policy examples. Its body has tf_body ≈ 20 and dl_body ≈ 600, with corpus avgdl_body ≈ 590:

norm_tf_body = 20 / (1 − 0.75 + 0.75 · 600/590)
             = 20 / (0.25 + 0.762)
             = 19.76

w_body · norm_tf_body = 1.0 · 19.76 = 19.76

Body contribution 19.76 vs title contribution 6.55. The body wins by roughly 3×. To make the title contribution catch up, w_title would need to climb past 15, and at that point any doc with incidental title overlap on common terms would dominate every query. Cranking weights to fix one query breaks ten others.

The token runinstances can only match the exact string RunInstances as written. A user typing "run instances" with a space, or any natural phrasing like "how to run an instance", hits nothing — neither run nor instances exists as a separate term in any title, because they're locked inside the compound. The tokenizer was throwing away the structural signal the title field was supposed to carry.

What the fix looks like

The change is to split tokens on CamelCase boundaries and underscores, then emit the original token plus each split piece. Original preserved so exact-match queries still work. Pieces added so natural-language queries can hit them.

tokenization · before → after
RunInstances runinstances · run · instances
API_RunInstances api_runinstances · api · runinstances · run · instances
getXMLData getxmldata · get · xml · data
bucket bucket
API api

All pieces are emitted at the same position. RunInstances sitting at position 0 of a title becomes three tokens all sharing position 0 in the inverted index. This matters for any future proximity scoring — they represent the same actual word in the document, so they should occupy the same coordinate, not three sequential ones.

The CamelCase rules

A CamelCase boundary in plain English is "where a new word starts inside a compound." Two rules catch the cases that matter:

Rule A — lowercase or digit to uppercase. RunInstances splits at the n→I boundary. v2Update splits at the 2→U boundary. This is the standard CamelCase split.

Rule B — uppercase to uppercase-followed-by-lowercase. XMLData splits as XML · Data, not X · M · L · Data. The signal is "the current uppercase letter is the start of a new word because the next letter is lowercase — the previous uppercase letters were the tail of an acronym." Without this rule, all-caps prefixes like API and HTTP would get shredded into single letters, which would destroy them as searchable tokens.

The two rules combine cleanly. getXMLData walks through Rule A at t→X and Rule B at the L→D boundary, producing get · XML · Data. API by itself hits neither rule (Rule B needs a lowercase letter after the uppercase run, which doesn't exist) and stays as a single token. RunInstances hits Rule A once and emits two pieces.

Underscores are simpler — split on every underscore, then run CamelCase splitting on each part. API_RunInstances splits into API and RunInstances on the underscore, then RunInstances splits via Rule A, then everything plus the original gets lowercased and pushed. Five tokens emitted from one input.

Implementation

The change is two small functions in cleanup.rs. The tokenizer's main loop was already character-walking the input — I just had to stop lowercasing inside the walk (so CamelCase signal survives), trim and split the case-preserved token, then lowercase each emitted piece.

fn camelcase_split(s: &str) -> Vec<String> {
    let chars: Vec<char> = s.chars().collect();
    let mut result = Vec::new();
    let mut current = String::new();
    for (i, &c) in chars.iter().enumerate() {
        let is_boundary = i > 0 && (
            // rule A: lowercase/digit -> uppercase
            (chars[i-1].is_ascii_lowercase() || chars[i-1].is_ascii_digit())
                && c.is_ascii_uppercase()
            // rule B: uppercase -> uppercase-lowercase
            || (i + 1 < chars.len()
                && chars[i-1].is_ascii_uppercase()
                && c.is_ascii_uppercase()
                && chars[i+1].is_ascii_lowercase())
        );
        if is_boundary && !current.is_empty() {
            result.push(current.clone());
            current.clear();
        }
        current.push(c);
    }
    if !current.is_empty() { result.push(current); }
    if result.len() > 1 { result } else { Vec::new() }
}

It returns an empty vec when there were no boundaries to split on — that way the caller can keep the original token cleanly without checking for duplicates. The outer split_compound wraps this with underscore handling and always prepends the original lowercased token to the result.

Thirty-four unit tests covering cleanup.rs caught the edge cases. The ones that matter:

RunInstances     → [runinstances, run, instances]
API_RunInstances → [api_runinstances, api, runinstances, run, instances]
getXMLData       → [getxmldata, get, xml, data]
API              → [api]                  (no boundary)
bucket           → [bucket]               (no boundary)
s3:GetObject     → [s3:getobject, get, object]
arn:aws:s3:::my-bucket → [arn:aws:s3:::my-bucket]   (no compound to split)

The ARN case is the one I had to be careful about — the existing tokenizer keeps : as a token-internal character, so the entire ARN survives as one token. CamelCase splitting only fires inside that one token, and there's no boundary inside arn:aws:s3:::my-bucket because the colons don't trigger Rule A or B. The original tokens that the corpus depends on stay intact.

Reindexing was required. The old final_index.bin was built with the un-split tokenizer, so its postings didn't have entries for run or instances separately. A clean reindex of the 14,266-doc corpus took about 2 minutes. Vocabulary went up because every compound that splits adds new terms to the index — the original token, plus each piece — but the actual count and final file size are something I'd need to measure cleanly, not eyeball.

Results

The same thirteen-query golden set from the previous article, regraded after reindexing. The changes were not uniformly positive.

querybeforeafterchange
s3 versioningAA=
iam policy syntaxCC=
lambda cold startC+C+=
cloudformation stack updateAA=
dynamodb partition keyA+Bregressed
RunInstancesDC+improved
vpc peeringAA+improved
bucket policy permissionsB−B−=
ec2 instance typesD+C+improved
route53 dns recordCC=
arn:aws:s3AA=
permssion (typo)ACregressed

Three improvements, two regressions, the rest flat. Net positive by one query on this set, but the set itself is too small to draw a conclusion from that number.

The regressions

permssion spell-corrects to permission. Before CamelCase splitting, the top results were prose docs about permission policies. After CamelCase, the top two results are API_CreateNetworkInterfacePermission.md and API_DeleteNetworkInterfacePermission.md — API reference docs whose titles end in Permission. Splitting NetworkInterfacePermission into network · interface · permission means those titles now have a title-field hit on the query term where they didn't before. They climbed.

Whether this is actually a regression depends on intent. A user typing "permission" looking for the policy-language overview wanted the old results. A user looking for any API that touches permissions is better served by the new ones. I don't have an intent classifier, and a thirteen-query golden set isn't enough to call it.

dynamodb partition key is the harder regression. HowItWorks.Partitions.md stays at rank 1, but bp-partition-key-design.md — the best-practices doc most users probably want here — fell out of the top 10 entirely. The CamelCase splits emitted from doc titles across the corpus inflated body-field tf for tangentially-related docs (the ones with Partition or PartitionKey in their titles), pushing them up and the best-practices doc down. I don't have a good defense for that one.

What the regressions cost vs what the change unlocked

A thirteen-query golden set is too small to settle the question. The regressions show up in queries I already had — the wins show up in queries I'd never have run before, because the engine literally couldn't answer them.

Eight new queries, all using CamelCase API names that have no natural-language phrasing in the corpus body:

querytop resultgrade
DescribeInstancesAPI_DescribeInstances.mdA+
ListBucketsBucketRestrictions.md (API at #3)B
PutObjectreplication-what-is-isnot-replicated.md (API at #5)C+
CreateFunctionexample_lambda_CreateFunction_section.mdA
StartQueryExecutions3_example_athena_GettingStarted_*.mdA+
UpdateStackAPI_UpdateStack.mdA+
GetItemDAX.consistency.md (API at #6)B
DeleteUseriam_example_iam_DeleteUser_section.md (API at #3)A

Three A+'s. Two A's. Two B's. One C+. Without CamelCase splitting, none of these queries would have returned anything useful — the compound token barely intersects with the corpus. A user typing "DescribeInstances" gets API_DescribeInstances.md with score 11.89, clearly differentiated from rank 2 at 10.67. That's the behavior you'd expect.

So the change cost one solid regression and one debatable one, and bought eight previously-impossible queries plus marginal wins on three of the original thirteen. I kept it.

What the failures tell me about BM25F

Look at the queries that still don't return what they should after CamelCase: RunInstances at C+ instead of A+, PutObject at C+, iam policy syntax at C, route53 dns record at C.

For RunInstances specifically, even with title weight already at 5.0 from the previous round of tuning, CamelCase didn't move the canonical doc to rank 1. It stayed at rank 3, behind two docs that mention RunInstances heavily in body. CamelCase added run and instances as separate title tokens to API_RunInstances.md — but those same tokens were added to the titles of every other doc whose name contains the compound. The relative position shifted only slightly because the tokenization change helped everyone with the compound, not just the canonical doc.

So the remaining failure for RunInstances isn't really a tokenization problem. There's no mechanism in BM25F to encode "this is the doc whose entire purpose is to describe RunInstances." The canonical-vs-example distinction is something a human notices immediately and bag-of-words scoring has no path to. BM25F counts tokens weighted by where they appear — what a doc is "about" isn't visible to it.

That's where the algorithm runs out of headroom. Field weighting gives a coarse signal about where a term appears, and per-field length normalization gives a coarse signal about how concentrated the term is. Neither is the same as understanding that API_RunInstances.md is the canonical reference and ExamplePolicies_EC2.md is an example doc that happens to mention RunInstances in passing. To capture that distinction the engine would need something like:

None of those is small work. What I do have, that I didn't before, is failures I can describe precisely — the exact queries that hit the ceiling and the exact reason they hit it.

What I have now

A BM25F search engine with field-aware scoring, CamelCase + underscore compound splitting, and trigram spell correction. A thirteen-query golden set with regraded results and eight CamelCase-specific test queries. Indexing in ~150s. Queries in 4–200ms. Four A's on the golden set, three A+'s on the expanded set. A short list of failure cases I can describe specifically — what the math says, why the algorithm runs out of signal there, and what kind of work would push past it.

The next thing for this engine is either a re-ranker on top of BM25F's top 50 or page-level authority signals from the doc cross-link graph. Both are bigger than anything in this article.


Code: github.com/sreenish27/Search_engine_rs