Fields: extracting structure from AWS markdown for BM25F

The previous article killed tiers, rewrote the on-disk posting format to be field-keyed, and threaded the new type through seven files. The scaffold for BM25F was in place — what wasn't in place was the actual scoring math, or any clarity on what those four fields would even be. The new posting format expected {Title, Headers, Code, Body} but I hadn't extracted any of those from a markdown file yet.

This article is about three things woven together: deciding what the field set should be by analyzing the corpus, implementing the BM25F math that turns per-field term frequencies into a score, and one small bug along the way where the standard Rust regex crate had a different idea than Python about what counts as a regex.

visual · one AWS doc, four fields

title · 1 doc → 1 first H1

RunInstances

headers · all H2/H3/H4 concatenated

Request Parameters · Response Elements · Errors · Examples · Example: Launch one instance · See Also

code · fenced blocks + inline spans

POST / HTTP/1.1 Host: ec2.amazonaws.com Action=RunInstances & Version=... & ImageId=ami-...

body · everything else

Launches the specified number of instances using an AMI for which you have permissions. You can specify a number of options, or leave the default options...

Why fields, not flat bag of words

Flat BM25 treats a document as a multiset of tokens with one length. S3 in the title of a 200-word doc and S3 sprinkled 15 times through the body of a 2000-word doc both turn into the same tf number adjusted by the same dl/avgdl. The structural difference — that one is a navigational signal and the other is incidental usage — gets averaged away.

BM25F treats a document as a fixed set of typed buckets, each with its own length and its own length-normalization tuning. Every term has four tf values (one per field), four dl values, and the field weights decide how much each bucket contributes to the final score. The IDF stays global — rarity of a term across the corpus is a property of the term, not the field — but everything else splits per field.

The shift in data structure is small — one extra layer of nesting. What it preserves is the structural signal a flat bag of words throws away.

Choosing the field set

The question I had to answer first: what fields exist in AWS markdown, and which ones are worth distinguishing? Markdown has plenty of structural elements — # headers, **bold** emphasis, fenced code blocks, inline backticks, list items, links, anchor tags. Every one of them is a candidate. But more fields means more weights to tune, more bytes per posting, and more potential for overfitting to the corpus.

I wrote a Python script (analyze_fields.py) that walks the entire 14,266-doc corpus and extracts candidate field text using the same regex patterns the Rust extractor would eventually use. For each doc it computes, per candidate field: the raw character length, the token count after running through a Python port of the Rust tokenizer, and whether the field is even present. Across the corpus it aggregates: total tokens per field, doc coverage (what fraction of docs have a non-empty version of this field), avg/median/min/max field length. The output lands in analysis_summary.md, field_stats.csv, and a folder of 8 sample-doc dumps for sanity checking.

The starting candidate set was six: title (first H1), headers (all H2/H3/H4), code (fenced + inline), bold_labels (the **Cause:** / **Type:** pattern that fills API reference docs), links (anchor text from [text](url)), and body (everything else).

I cut bold_labels and links within the first analysis pass. Bold labels are inconsistent: in API reference docs they're structural markers like **Cause:**, but in narrative prose they're just emphasis on a random word. Splitting them out would give them weight that isn't earned uniformly. They lump into body. Links are anchor text from [ListDevices](...) patterns, and the anchor text is almost always either the same as a token already in the body or a URL fragment that adds nothing. Lump into body.

The final four are what survived: title, headers, code, body. Here's what the analysis showed for the 14,266-doc corpus:

field	total tokens	doc coverage	avg len	median	max
`title`	68,452	100%	4.8	5	21
`headers`	188,365	75%	17.6	11	913
`code`	1,718,329	76%	158	36	19,528
`body`	8,411,992	99.98%	590	338	33,162

Three things came out of this table that I didn't expect.

Title is short and uniform. Average 4.8 tokens, max 21. Short and predictable is what makes title a strong field — a hit there is information-dense because there's almost no other content competing with it.

Code has wild variance. Mean 158 but median 36 and max 19,528. There's some doc with twenty thousand tokens of code in it. Either a legitimate huge API reference with dozens of large examples, or a doc that's mostly a code dump with a thin prose wrapper. Either way, length normalization on the code field is going to need to be aggressive, because otherwise the long-code docs will dominate every query that touches code-shaped tokens.

Headers maxes out at 913. I spot-checked this one because it looked suspicious — 913 tokens of section headers in one doc is a lot. The doc turned out to be IAM/.../access-analyzer-reference-policy-checks.md, which has fifty-plus ## Error – <name> sections, one per policy check. Legitimate. No bug, just a doc whose actual structure is hundreds of similarly-named subsections.

The sample dumps in the output folder were what gave me confidence the regex extraction was clean. Every dump shows the raw extracted text per field and the first 40 tokens. I read through eight of them. Code didn't bleed into body. Body didn't bleed into headers. The token streams looked like what I'd produce by hand if I were highlighting the fields with a marker.

Except for one thing.

The bold marker problem

Every API reference body field had garbage tokens like ** showing up. Look at the body extraction for API_AcceleratorCount.md:

['**', 'max', '**', 'request', '**', 'max', '**', 'response',
 'the', 'maximum', 'number', 'of', 'accelerators', ...]

Eight ** tokens in the first line of body for one doc. Across the corpus, every API reference page that uses the ** FieldName ** labeling pattern was producing these. They're not in any way meaningful as search terms — nobody types ** into a search box. They're going to appear in every API doc, which means their IDF will be near zero and they'll be useless for ranking. But they'll still occupy bytes in the index and noise in the term vocabulary.

The earlier tokenizer intentionally kept * in the character keep-list, because IAM wildcards like s3:Get* need the trailing star preserved. So when raw markdown like ** Max ** (request) hit the tokenizer, the ** survived as its own token rather than being treated as splitter punctuation. The tokenizer comment had even called this out explicitly:

// The tokenizer is NOT responsible for stripping markdown syntax.
// That's the adapter's job (next layer up). Here we just verify
// that the tokenizer behaves predictably when fed raw markdown.

The "adapter" was the field-extraction layer I was now writing. The tokenizer was right to punt — markdown stripping is a structural-text concern, not a tokenization concern. The fix was to strip emphasis markers from the field text before handing it to the tokenizer, and only from the non-code fields. Code field stays raw because * has semantic meaning there (IAM wildcards in JSON examples, glob patterns in CLI snippets, even C pointers in code samples). Strip ** from a code block and you corrupt the data.

The Python version of the fix was three lines:

RE_EMPHASIS_DOUBLE = re.compile(r"\*\*|__")
RE_EMPHASIS_SINGLE = re.compile(r"(?<![A-Za-z0-9_])[*_](?![A-Za-z0-9_])")

# applied to title, headers, body — NOT code
def strip_emphasis(s):
    s = RE_EMPHASIS_DOUBLE.sub(" ", s)
    return RE_EMPHASIS_SINGLE.sub(" ", s)

The single-emphasis regex uses lookbehind and lookahead to only strip * or _ when neither side is alphanumeric — so my_var and s3:Get* survive untouched, but **Note** and * italic * get stripped. The double-emphasis regex catches ** and __ directly. Order matters: doubles get stripped first so the single-emphasis pass doesn't see them as two adjacent singles.

Reran the analysis. The ** tokens were gone. The token stream for the same API doc now read:

['max', 'request', 'max', 'response',
 'the', 'maximum', 'number', 'of', 'accelerators', ...]

Clean.

Then I ported the same logic to Rust, ran the indexer, and got this panic:

thread 'main' panicked at src/field_extract.rs:40:83:
regex parse error:
    (?<![A-Za-z0-9_])[*_](?![A-Za-z0-9_])
    ^^^^
error: look-around, including look-ahead and look-behind,
       is not supported

A digression about regex engines

Python's re module supports lookbehind and lookahead. Rust's standard regex crate doesn't, and not by accident.

Rust's regex crate guarantees linear-time matching by restricting itself to regular expressions that compile into a finite automaton. Lookaround breaks that property — an expression with lookbehind can require backtracking, and worst-case matching time can be exponential in input length. Python's re accepts that tradeoff. Rust doesn't. The corollary is that Rust regex won't match patterns that need arbitrary lookaround, and you have to express your intent without them.

The fix is to capture the boundary characters explicitly instead of asserting them with lookaround:

// before: lookbehind/lookahead (Python-style)
r"(?<![A-Za-z0-9_])[*_](?![A-Za-z0-9_])"

// after: capture groups (linear-time-safe)
r"(^|[^A-Za-z0-9_])[*_]($|[^A-Za-z0-9_])"

And then in the replacement, reference the captured boundaries:

r.emphasis_single.replace_all(s, "$1 $2")

Same intent, different mechanism: instead of saying "match a * that isn't surrounded by word chars," it says "match a * plus its non-word-char boundaries on either side, then preserve those boundaries and replace the * with a space." The captured boundary chars ($1 and $2) stay in the output. The * doesn't.

This works in Rust's regex because there's no lookaround — the regex is a pure left-to-right scan with one capture per boundary. Linear time. Strict guarantee preserved.

I could have switched to the fancy-regex crate, which does support lookaround. I didn't — adding a dependency to work around one regex wasn't worth it when a five-character rewrite handles the case.

BM25F: the math

With the fields extracted and the bold problem gone, the scoring math could be written. BM25F is a small set of compositions on top of BM25, and the easiest way to lay it out is as the formulas themselves.

For a single (term, doc) pair across all fields:

bm25f score per term, per doc

IDF(t) = ln(1 + (N − df + 0.5) / (df + 0.5))

norm_tf_f = tf_f / (1 − b_f + b_f · dl_f / avgdl_f)

tilde_tf = Σ_f w_f · norm_tf_f

score(t, d) = IDF(t) · tilde_tf · (k₁ + 1) / (tilde_tf + k₁)

Three things distinguish this from flat BM25.

IDF stays global. A term's rarity is a property of the term, not of where the term appears. cloudfront is rare across all 14,266 docs whether it shows up in titles or bodies. One df per term, one IDF.

Length normalization is per-field. Each field has its own b_f parameter and its own avgdl_f. The whole reason for splitting into fields is that they have different length distributions and should be normalized differently. Title with mean length 4.8 and body with mean length 590 want completely different b settings — short uniform titles need almost no length adjustment, long variable bodies need a lot.

Field weighting happens inside the pseudo-tf, not outside. You don't compute BM25 per field and then sum the per-field scores. Instead you compute a single weighted-and-normalized pseudo-term-frequency across all fields, and only then plug that one number into the BM25 saturation curve. The reason matters: BM25's saturation (the k₁ + 1 / tf + k₁ piece) only makes sense applied once. If you applied it per-field-then-summed, you'd be saturating each field independently, which would let a doc with the term spread thinly across all fields outscore a doc with the term concentrated in the best field. The pseudo-tf formulation prevents that — concentration of weighted hits in one field still produces a high tilde_tf, which the saturation then processes once.

What's computed when

The math splits cleanly into two phases.

Computed once, at index time, held in RAM: term_index (the term → offset/length/doc_freq dictionary), doc_stats (per-doc per-field lengths), avg_lengths (per-field corpus averages, computed by walking doc_stats once after traversal completes), total doc count N, and the BM25FParams struct that holds k₁ and the eight tunable weights and b values. None of these depend on what the user types. Compute once, reuse for every query.

Computed at query time, per query: the spell-corrected query term list, posting lists per query term (read from disk via read_postings), the candidate doc-id set (from intersect_all), and the BM25F scores for each candidate doc. These exist only during one query, discarded after.

avg_lengths was the new variable this refactor introduced. It's a 16-byte struct (four f32s), computed once after traversal by summing per-field token counts across all docs and dividing by total doc count. Worth noting: it divides by total doc count, not by the number of docs where the field is non-empty. The BM25F semantic justification is that a doc with no title is genuinely a "doc with a short title (zero tokens)" — counting it in the average reflects the actual distribution the corpus has. If you divided only by non-empty docs, the avg for headers would be artificially inflated, and length normalization on docs without headers would behave inconsistently.

One small implementation detail: the sums are accumulated as f64 and only cast down to f32 at the end. Summing tens of millions of small integers into an f32 can lose meaningful precision around 2²⁴ ≈ 16.7M. The body corpus already exceeds that — total body tokens is ~8.4M, which is fine for f32 directly, but at 10× corpus scale the math is wrong. f64 sums plus a final f32 cast handles all reasonable scales for the cost of nothing.

Picking the starting weights

BM25F adds nine tunable parameters: k₁, four field weights (w_title, w_headers, w_code, w_body), and four field-specific b values. The honest answer to "what should they be?" is that you tune them against an evaluation set — a corpus of (query, relevance judgment) pairs you can score against. I don't have that yet. Without one, weight tuning is guesswork.

What you can do is start with defensible defaults from the structure of the corpus. k₁ = 1.2 is the BM25 standard. For weights, the ordering should reflect signal-density per field: a title hit is more meaningful than a header hit, which is more meaningful than a code hit, which is more meaningful than a body hit. The ratios are a guess — 3:2:1.5:1 felt reasonable. For b values, the ordering follows length variance: title b low (around 0.3) because titles are short and uniform and don't need much length adjustment, body b at the BM25 default of 0.75 because bodies are long and varied, code somewhere between because code has high variance but its long instances are usually legitimate.

These are starting values. The first real test was running the engine against a small set of test queries and seeing what came back.

Results: thirteen queries

I picked thirteen queries that span different intents — navigational ("RunInstances", "vpc peering"), conceptual ("lambda cold start"), high-frequency-term ("s3", "bucket policy permissions"), the typo case ("permssion"), and one full-ARN tokenizer regression test ("arn:aws:s3"). For each query I noted the top result and graded it A/B/C/D against my own knowledge of what doc the user probably wanted.

The engine indexed 14,266 docs in 132 seconds and merged blocks in 17 seconds, ending with 166,578 unique terms in term_index and a 25.9 MB final_index.bin. Average field lengths matched the Python analysis numbers within rounding:

Avg lengths — title: 4.8  headers: 13.2  code: 120.4  body: 589.6

(The headers and code averages are lower in Rust than Python because Rust divides by total doc count and Python had divided by non-empty doc count; 17.6 × 0.75 ≈ 13.2 and 158 × 0.76 ≈ 120, which is exactly the coverage scaling at work.)

query	top result	grade
`s3 versioning`	manage-versioning-examples.md	A
`iam policy syntax`	access_policies_policy-validator.md	C
`lambda cold start`	java-customization.md	C+
`cloudformation stack update`	using-cfn-updating-stacks-monitor-stack.md	A
`dynamodb partition key`	HowItWorks.Partitions.md	A+
`RunInstances`	ExamplePolicies_EC2.md	D
`vpc peering`	API_CreateVpcPeeringConnection.md	A
`bucket policy permissions`	object-ownership-migrating-acls-prerequisites.md	B−
`ec2 instance types`	instance-discovery.md	D+
`route53 dns record`	hosted-zones-migrating.md	C
`s3`	lifecycle-and-other-bucket-config.md	—
`arn:aws:s3`	batch-ops-iam-role-policies.md	A
`permssion` (typo)	spell-corrected to `permission`	A

Where it falls short

The failures worth looking at closely are the ones BM25F architecturally should have caught and didn't.

RunInstances returned API_RunInstances.md at rank 4, not rank 1. The doc has the term in its title, which is exactly the case BM25F is supposed to handle by boosting title-field hits. But RunInstances in the title is one token (lowercased to runinstances), competing against twenty body-field tokens of the same term in other docs. With title weight 3.0 and one title hit, the title contribution is around 6.5. With body weight 1.0 and twenty body hits saturating through BM25's k₁ curve, the body contribution is around 19. Body wins.

I tried bumping w_title from 3.0 to 5.0. Two queries moved D → C+. The rest didn't move. Cranking weights wasn't the right lever — the problem was that single-token CamelCase API names compete with twenty body mentions, and you'd need w_title close to 20 to consistently win, at which point unrelated docs with any title overlap would start dominating.

The actual fix is upstream: split RunInstances into run, instances, and the original runinstances at tokenize time, so the title field hits three times for the natural-language query "run instances" instead of zero. That's the next article.

What I noticed from this round of failures: with per-field tf exposed, I can actually do the multiplication and watch which contribution wins. The earlier spell-corrector failure had the same shape — the bug wasn't visible until I looked at the right number. The math being explicit, per-field, is what made the RunInstances failure something I could diagnose instead of just stare at.

What I have now

A working BM25F search engine over 14,266 AWS docs. Four fields (title, headers, code, body) extracted from raw markdown by a regex-driven Rust function. Per-field length normalization and global IDF, with nine tunable parameters that have defensible defaults pending a real evaluation set. Indexing in 132 seconds, merging in 17 seconds, queries in 4–200ms depending on candidate count. A thirteen-query test set with explainable wins and explainable failures.

The failures are all variations of the same underlying issue — single-token CamelCase identifiers like RunInstances and AcceleratorCount can't be reached by natural-language queries even when they sit in the title field. That isn't a weight problem. That's a tokenizer problem, which is what the next piece is about.

Code: github.com/sreenish27/Search_engine_rs