Lexi

Documentation

Evaluation logic, scoring criteria, product assumptions, limitations, and open product risks.

1

How the tool works

Lexi evaluates content for retrieval fitness, not general writing quality. Its purpose is to estimate how reliably a modern language-model-driven retrieval system can identify, interpret, and reuse information from a page or section.

The tool works in two stages.

Evaluation pipeline
STAGE 01Intent AnalysisUnscored
Parent topic
Subtopic angle
Schema types
Expected entities
Reference frame for scoring
STAGE 02Scored EvaluationWeighted
Entity Coverage0.30
Relationship Clarity0.30
Messaging Clarity0.25
Neutrality0.15
→ overall_score
1.1

Stage 1: Intent analysis

Stage 1 is not scored. It exists to anchor the rest of the evaluation to what the content is actually trying to cover.

For each input, Lexi determines:

  • Parent topic: the broad subject domain.
  • Subtopic angle: the specific semantic angle or use case the page is addressing.
  • Schema types: the most specific likely Schema.org types that fit the page.
  • Expected entities: the core concepts, actors, objects, or terms that a strong page on this topic should address.

This stage matters because retrieval quality depends heavily on whether the content is judged against the correct semantic frame. A page about “beta invite systems” and a page about “API authentication” may overlap in terminology, but they should not be evaluated against the same entity expectations.

1.1.1

What Stage 1 is doing conceptually

Lexi first tries to answer:

  • What is this page really about?
  • What kind of thing is it trying to explain?
  • What named or implied concepts should appear if the page is complete?
  • What semantic commitments does the page appear to make?

The output of this stage becomes the reference frame for scoring. In practice, that means strong writing can still score poorly if it is judged to omit key entities for its inferred topic.

1.1.2

Why users should care

If the inferred topic is correct, the downstream scoring is usually useful and targeted.

If the inferred topic is wrong, the rest of the analysis can be directionally misleading. For that reason, Stage 1 should be treated as the first validity check on any evaluation. Users should verify that Lexi has identified the page correctly before acting on the score.

1.2

Stage 2: Scored evaluation

Once intent is established, Lexi evaluates the content in semantic chunks.

A semantic chunk is defined as:

  • one heading section
  • plus the paragraphs that belong to that section

Each chunk is scored independently, then rolled up into a page-level result.

This design reflects how retrieval systems often operate in practice: they do not always preserve full-document context. They frequently work with partial passages, chunks, or extracted spans. Because of that, Lexi emphasizes whether a section can stand on its own.

1.2.1

What Stage 2 is measuring

For each chunk, Lexi asks questions like:

  • Does this section contain the entities it should contain?
  • Are the relationships between those entities explicit?
  • Can the chunk be understood without relying on surrounding sections?
  • Does it read as factual and extractable, rather than promotional or evasive?

The key principle is local extractability. If an important claim is only understandable after reading three earlier sections, Lexi will usually treat that as a weakness.

1.2.2

Why this chunk-based approach exists

Many documents are coherent for human readers because humans naturally carry forward context. Retrieval systems are less reliable in that respect. A model may encounter only one chunk, or a chunk may be extracted without the context that made it meaningful in the full page.

Lexi therefore favors writing that makes complete claims within a bounded section.

Semantic chunk — the unit of evaluation

Page structure

Extracted unit

Must stand alone — no surrounding context

1.3

Output format

Lexi produces two output layers:

  • a streaming narrative explanation
  • a structured JSON block

The narrative explains what the tool sees and why it is assigning scores. The JSON provides machine-readable structure for rendering, storage, QA, or downstream processing.

This dual-output model serves two audiences at once: humans, who need diagnostic interpretation, and systems, which need normalized structured fields.

2

Scoring system

Lexi uses a weighted scoring framework designed for retrieval-oriented analysis.

2.1

Score scale and formula

Each sub-signal is scored on a 0–10 scale. Each criterion score is calculated as a weighted average of its sub-signals. The final page score is calculated as:

overall_score = sum(criterion_score × weight) × 10

The score ceiling is 97. Scores above 97 are never produced.

This ceiling is intentional. It prevents the interface from implying that a page is “perfect,” and it preserves headroom for exceptional performance without normalizing inflated scores.

2.2

Scoring philosophy

Lexi is intentionally conservative at the high end.

A 9.0–9.5 sub-score is reserved for genuinely exceptional execution. It should not be awarded for merely competent or even very good content. This matters because many scoring systems become uninformative when they cluster too much content in the 9+ range.

Lexi also recognizes that some sub-signals have natural ceilings depending on topic type. For example, a topic that does not require multi-hop reasoning should not be expected to earn a perfect score on chain-based reasoning sub-signals. In those cases, a 7–8 may represent fully adequate performance rather than a deficiency.

2.3

What the score is for

The score is meant to be:

  • a compressed summary of retrieval-oriented structural quality
  • a way to compare drafts, sections, or revisions
  • a prioritization aid for editing

The score is not meant to be:

  • a measure of factual correctness
  • a universal measure of writing quality
  • a direct predictor of search ranking or AI answer inclusion

Users should treat the score as a diagnostic index, not a guarantee.

3

Scoring criteria

Lexi scores content across four criteria.

Criterion weights
Entity Coverage
0.30
Relationship Clarity
0.30
Messaging Clarity
0.25
Neutrality
0.15
Total1.00
3.1

Entity Coverage

Weight: 0.30Primary criterion

Entity Coverage measures how completely the content addresses the entities that should appear for the inferred topic and subtopic.

This criterion reflects a simple retrieval reality: if the relevant concepts are absent, a retrieval system has less evidence to connect the content to the right questions.

3.1.1

Sub-signals

Core entity presence
Checks whether the primary entities for the subtopic are present at all.
Entity prominence
Checks whether those entities are substantively addressed, rather than merely mentioned in passing.
Entity completeness
Checks whether secondary and supporting entities are also present.
3.1.2

Interpretation

A page can fail Entity Coverage even if it sounds polished. Lexi is not satisfied by elegant prose that leaves out the core terms, actors, concepts, or objects needed to establish semantic completeness.

A single mention does not count as strong coverage. The entity has to be developed enough that a retrieval system could plausibly connect the page to related questions about that entity.

3.1.3

Typical failure patterns

  • Important entities never named explicitly
  • Core concepts mentioned once but never explained
  • Secondary concepts omitted, making the page too narrow
  • The page assumes prior knowledge and skips basic entity grounding
3.1.4

What to do when this score is weak

Inspect whether the content actually names and explains the entities a reader or model would expect. Improvements usually involve:

  • explicitly naming missing entities
  • expanding treatment of thinly mentioned concepts
  • adding supporting concepts that clarify the core topic
3.2

Relationship Clarity

Weight: 0.30Primary criterion

Relationship Clarity measures whether entities are connected through explicit, extractable relationships within a single chunk.

Lexi is looking for clear subject-verb-object structures and qualified claims that survive extraction. It is not enough that a human reader could infer the relationship by combining scattered clues.

3.2.1

Sub-signals

Predicate completeness
Checks whether relationships are expressed as complete propositions.
Condition specificity
Checks whether claims are qualified with conditions, thresholds, baselines, or measurable context.
Relationship directionality
Checks whether it is clear which entity acts on which.
Chain completeness
Checks whether multi-step reasoning is traceable when the topic requires it.
3.2.2

Interpretation

This criterion is central to Lexi because retrieval systems often work best when claims are explicit and local. A chunk that says “Invite codes create a signed session token valid for seven days” is stronger than a chunk that separately mentions invite codes, session tokens, and validity period without tying them together.

Directionality matters. “A depends on B” is not the same as “B depends on A.” If the sentence leaves agency or dependency unclear, extraction reliability falls.

Condition specificity also matters. Relative claims such as “much faster,” “better,” or “significantly lower” are weak unless they specify compared to what.

3.2.3

Typical failure patterns

  • Relationships implied but not stated
  • Ambiguous verbs or missing agents
  • Conditions omitted
  • Cause and effect blurred
  • Important logic split across multiple chunks
3.2.4

What to do when this score is weak

Rewrite key claims so the section itself states what is acting, what it acts on, under what conditions, and with what consequence. Explicitness usually helps more than stylistic elegance here.

3.3

Messaging Clarity

Weight: 0.25Primary criterion

Messaging Clarity measures whether each chunk is self-contained, intelligible, and unambiguous.

This criterion reflects the fact that a chunk may be extracted out of context. The opening sentence, key referents, conditions, and definitions therefore need to function locally.

3.3.1

Sub-signals

Chunk opening quality
Checks whether the first sentence of the chunk stands alone as a useful claim.
Cross-chunk resolution
Checks whether the section begins with unresolved references like “this,” “it,” or “they” without naming the subject locally.
Condition completeness
Checks whether relative or conditional language is properly anchored.
Definition adequacy
Checks whether the entities are sufficiently defined for the expected audience and angle.
3.3.2

Interpretation

A section opener like “This allows more flexibility” is weak unless the same chunk names what “this” refers to. Human readers may recover the referent from previous context, but extraction systems may not.

Lexi allows normal within-paragraph pronoun use if the subject is named in the first sentence of that paragraph. The concern is not pronouns themselves; it is unresolved reference at chunk boundaries.

Definition adequacy is audience-relative. A highly technical page can assume more prior knowledge than a general explainer, but it still has to define terms enough for the intended semantic task.

3.3.3

Typical failure patterns

  • Section opens with “this,” “it,” or “they”
  • Important condition words lack baselines
  • Definitions are too thin for the topic
  • The chunk depends heavily on the preceding section to make sense
3.3.4

What to do when this score is weak

Strengthen chunk openings, restate the named subject earlier, define key terms locally, and anchor relative phrases with concrete reference points.

3.4

Neutrality

Weight: 0.15Overlay criterion

Neutrality measures whether the content reads as extractable explanatory material rather than promotional or hedged copy.

This criterion is an overlay because it modifies trust and reuse potential rather than defining the topic itself.

3.4.1

Sub-signals

Superlative usage
Penalizes unsupported superlatives such as “best,” “ultimate,” or “unmatched.”
Brand / first-person voice
Penalizes explanatory passages framed around “we,” “our approach,” or similar brand-centered voice.
Hedge language
Penalizes phrasing that weakens extractability, such as “may,” “might,” or “could potentially,” when stronger claims would be appropriate.
Embedded CTAs
Penalizes calls to action inside explanatory sections.
Attribution quality
Provides a lift when claims are specifically attributed to named sources.
3.4.2

Interpretation

Retrieval systems are often more confident reusing neutral, factual material than overtly promotional copy. Lexi therefore treats heavy persuasion, self-positioning, or unsupported marketing language as a structural drawback for citability.

Attribution can improve this score. Clean but unsourced material may still perform reasonably, but specific attribution can increase perceived reliability and claim portability.

3.4.3

Typical failure patterns

  • Marketing superlatives with no evidence
  • Explanatory copy dominated by first-person brand framing
  • Excessive hedging that prevents confident extraction
  • CTA language embedded in informational paragraphs
3.4.4

What to do when this score is weak

Separate persuasive copy from explanatory copy, remove unsupported superlatives, reduce first-person brand framing, and strengthen attribution where possible.

4

Score bands

Lexi maps final scores into qualitative bands.

Score bands
049698797
0–49
50–69
70–87
88–97
Systematic issues
Significant gaps
Well-structured
Exceptional
BandLabelInterpretation
88–97Exceptionally well-optimisedContent is already highly optimized for retrieval. Remaining issues are usually narrow or marginal. Strong structural performance — not perfection.
70–87Well-structured, minor gapsSolid foundations with room for targeted improvement. Most content in this band is already usable for retrieval. A strong outcome in Lexi’s scoring philosophy.
50–69Retrievable but significant gapsWorkable but structurally inconsistent. Retrieval may succeed in some contexts and fail in others. Often the range where targeted revision has the biggest payoff.
0–49Systematic issues affecting citabilityFundamental structural problems. The page may be difficult to interpret reliably at the chunk level, or may omit too much of the expected semantic content.
4.5

How to interpret bands

Bands are not moral judgment or literary quality. They are statements about retrieval reliability.

A page may convert well, sound good, and still underperform in Lexi because it is optimized for persuasion rather than extractable explanation.

5

Output structure

Lexi produces a narrative and a structured JSON payload. The structured output includes the following fields.

Output structure
Evaluation output
├──Stage 1 fields
├──parent_topic
├──subtopic_angle
├──schema_types
└──expected_entities
├──overall_score
├──criteria×4
├──id · name · weight · score · weighted_score · diagnostic · is_overlay
└──sub_signals×n
└──name · weight · score · type · diagnostic
└──flagged_items×n
└──chunk_heading · text · failure_type · subsignal
5.1

Stage 1 fields

parent_topic
subtopic_angle
schema_types
expected_entities

These fields document the semantic frame Lexi used for the evaluation.

5.2

Page-level scoring fields

overall_scoreThe final weighted score, scaled to the product’s range and capped at 97.
5.3

Criterion-level fields

For each criterion, Lexi returns:

id
name
weight
score
weighted_score
diagnostic
is_overlay

This allows both human review and front-end rendering of detailed score components.

5.4

Sub-signal fields

For each sub-signal under a criterion, Lexi returns:

name
weight
score
type
diagnostic

This provides fine-grained explainability. It makes it possible to distinguish, for example, between weak entity presence and weak entity prominence, which may require different edits.

5.5

Flagged items

Lexi can also return flagged problem instances containing:

chunk_heading
text
failure_type
subsignal

These tie diagnostics back to concrete parts of the source content.

5.6

Why the output is structured this way

The structure supports several use cases:

  • UI scorecards and diagnostics
  • future reporting or export workflows
  • quality assurance and regression testing
  • model behavior inspection
  • revision workflows at section level

The JSON is not just an implementation detail. It is part of the product’s explainability model.

6

Product assumptions

Lexi relies on a number of assumptions. These assumptions are not incidental; they shape what the score means and where it can mislead users.

6.1

The model can infer the page's true intent

Lexi assumes the model can correctly identify the page's parent topic, subtopic angle, likely schema types, and expected entities. This assumption is necessary because the scoring framework depends on topic-relative expectations.

6.1.1

What this means for users

If the page is mixed-purpose, poorly signposted, or semantically ambiguous, Lexi may anchor to the wrong interpretation. In that case, the scoring may be internally consistent but externally misaligned with the author's intention. Users should always sanity-check Stage 1 before trusting Stage 2.

6.2

Heading-based chunking is a valid unit of evaluation

Lexi assumes a heading section plus its associated paragraphs is a useful approximation of a retrievable unit. This is a practical design choice, not a universal truth about how all content is consumed.

6.2.1

What this means for users

Some content formats depend on layout, tables, captions, sidebars, footnotes, or visual sequencing. Those formats may be underspecified when reduced to heading-plus-paragraph chunks. Lexi may therefore under-credit content whose meaning emerges from structure outside the chunk.

6.3

Retrieval systems prefer explicit local relationships

Lexi assumes that retrieval systems are more reliable when claims are explicit within a single chunk than when they must be inferred across distant passages. This assumption drives the strong weighting of Relationship Clarity and Messaging Clarity.

6.3.1

What this means for users

Lexi will often prefer explicit, slightly repetitive writing over elegant, cumulative longform writing. Users optimizing for Lexi may need to repeat named subjects or restate relationships more often than a traditional editor would prefer.

6.4

Retrieval quality is separable from general writing quality

Lexi assumes that a page can be strong for retrieval while being ordinary for style, and vice versa. This is a foundational product assumption. Without it, the scoring framework would collapse into a generic writing evaluator.

6.4.1

What this means for users

A low score does not necessarily mean the page is bad for humans. It means the page is less structurally dependable for retrieval, extraction, citation, and reuse.

6.5

Neutrality improves extractability

Lexi assumes that neutral, factual phrasing is more reusable than overtly promotional or heavily hedged phrasing. This is why Neutrality appears as an overlay criterion.

6.5.1

What this means for users

Marketing content may be penalized even when it is doing its job well from a conversion perspective. Lexi's idea of "better" is specific to retrieval-oriented explanatory usefulness.

6.6

The selected model can judge consistently enough for scoring

The scoring system is currently configured around claude-sonnet-4-20250514. Lexi assumes the model is capable of making sufficiently stable judgments about topic framing, entity expectations, chunk extractability, and sub-signal diagnostics.

6.6.1

What this means for users

Scores are model-mediated. A different model, or even future changes to the same model family, may change the character of the output. Users should not treat the score as if it came from a deterministic rules engine.

6.7

The weights represent a useful theory of retrieval fitness

Lexi assumes that the chosen weights meaningfully reflect what matters most for retrieval optimization: Entity Coverage (0.30), Relationship Clarity (0.30), Messaging Clarity (0.25), Neutrality (0.15).

6.7.1

What this means for users

The output reflects a framework, not a universal law. It is an opinionated scoring system designed around a specific view of what makes content citable and extractable.

7

Limitations

Lexi’s limitations should be explicit because many users will otherwise over-interpret the result.

7.1

It does not verify factual accuracy

Lexi evaluates structure, clarity, and extractability. It does not check whether the claims are true.

7.1.1

User impact

A factually wrong page can still score well if it is well structured. Users should not treat Lexi as a fact-checker.

7.2

It is not a ranking predictor

Lexi does not prove that a page will rank better in search or appear more often in AI-generated answers.

7.2.1

User impact

The tool gives structural guidance, not guaranteed visibility outcomes. Retrieval fitness may help, but the score is not a promise of distribution.

7.3

Scores are model-mediated rather than fully deterministic

The same framework can still produce some variation across runs, versions, or edge cases.

7.3.1

User impact

Users should expect directional consistency, not exact lab-grade repeatability. Small score changes should be interpreted carefully, especially near thresholds.

7.4

Topic identification can fail on ambiguous pages

Stage 1 may anchor to the wrong parent topic or subtopic angle when the input is broad, blended, or weakly signposted.

7.4.1

User impact

Downstream scoring may be well reasoned for the wrong frame. This is one of the most important failure modes to watch.

7.5

Chunk-based scoring can penalize strong longform writing

Some writing relies intentionally on buildup, transitions, and cumulative context.

7.5.1

User impact

Content optimized for human narrative flow may underperform in Lexi compared with content optimized for local extractability.

7.6

The framework prefers explicitness and may over-penalize brevity

A concise statement can be perfectly adequate for a human reader while still being under-specified for retrieval.

7.6.1

User impact

Writers may need to be more explicit, repetitive, or definitional than they would otherwise prefer.

7.7

Promotional or brand-led pages are at a structural disadvantage

Neutrality penalizes unsupported superlatives, first-person brand voice, and embedded CTAs in explanatory sections.

7.7.1

User impact

Commercially effective marketing pages may score lower because the tool values citation-like extractability over persuasion.

7.8

The score ceiling can be misunderstood

Because the maximum reported score is 97, some users may incorrectly assume that even excellent pages are somehow incomplete.

7.8.1

User impact

97 is an intentional cap, not a system failure or hidden penalty.

7.9

Not all topics support all sub-signals equally

Some topics naturally involve dense chains and conditionality; others do not.

7.9.1

User impact

Lower sub-scores in certain areas do not always imply a defect. Sometimes they reflect the actual structure of the topic.

7.10

Theme preferences are local to the browser

Theme state is stored in localStorage.

7.10.1

User impact

A user's preferred theme will not automatically carry across devices or browsers.

7.11

BYO mode is session-scoped

In BYO mode, the API key is stored in sessionStorage.

7.11.1

User impact

The key is tied to the current browser session. Users may need to re-enter it in future sessions or on other devices.

7.12

Browser-direct API usage has trust implications

BYO mode involves browser-direct API interaction.

7.12.1

User impact

Even if the key is not sent to Lexi servers and is only stored in session storage, some users may still be uncomfortable entering an API key into a browser-based interface. This is a product trust consideration, not only a technical one.

7.13

Sponsored mode is rate-limited

Sponsored evaluation and recommendation endpoints are capped per session, and invite redemption is capped per IP.

7.13.1

User impact

Heavy users may hit temporary usage limits and need to wait before continuing.

7.14

Invite handling is simple, but not infinitely scalable

Invite codes are managed through environment configuration.

7.14.1

User impact

This works well for beta access control but is not the same as a full entitlement or account-management system.

7.15

No database-backed session record is assumed

Sponsored sessions are self-contained and HMAC-signed rather than backed by a database record.

7.15.1

User impact

This simplifies operations but may limit advanced account features such as fine-grained revocation, session history, auditability, or detailed entitlement logic unless further systems are added.

7.16

Lexi is an opinionated scoring framework, not a universal arbiter

The framework reflects a specific theory of retrieval optimization.

7.16.1

User impact

Users should use Lexi as a decision-support tool, not a final authority on content quality.

8

Open product risks

The following risks are not just theoretical; they follow directly from how Lexi is designed.

8.1

Risk: score overinterpretation

Users may treat the score as more objective, predictive, or comprehensive than it really is. This is especially likely when a numeric output appears alongside detailed diagnostics, because numbers create a false sense of precision.

8.1.1

Why this risk exists

  • The system looks rigorous
  • The score is normalized and weighted
  • Users naturally want a single answer
  • Retrieval performance in the real world is multifactorial, but the product surfaces a compact number
8.1.2

What could go wrong

  • Users optimize for the number instead of the diagnostic substance
  • Teams compare pages too simplistically
  • Stakeholders treat the score as a KPI without understanding its boundaries
8.1.3

Mitigation

  • Explain the framework clearly
  • Position the score as a summary, not a guarantee
  • Keep diagnostics prominent
  • Explain the 97 ceiling and strict top-end philosophy
8.2

Risk: misalignment between retrieval quality and brand goals

Lexi may encourage changes that improve extractability but reduce narrative quality, persuasion, or brand distinctiveness.

8.2.1

Why this risk exists

The framework explicitly favors explicitness, local self-containment, and neutral explanatory tone. Many brand and marketing systems favor elegance, momentum across sections, emotional language, and differentiation through voice.

8.2.2

What could go wrong

  • Copy becomes repetitive or flat
  • Explanatory accuracy improves while conversion performance drops
  • Teams over-optimize informational sections without protecting commercial goals
8.2.3

Mitigation

  • Position Lexi as one optimization layer
  • Encourage selective use on explanatory content
  • Distinguish between pages meant to convert and pages meant to be cited
8.3

Risk: model drift

Changes in the underlying model may change how Lexi interprets topics, entities, or chunk quality.

8.3.1

Why this risk exists

The evaluation depends on model judgment at multiple stages: intent inference, expected entity generation, sub-signal scoring, and diagnostics. Any model change can alter these judgments.

8.3.2

What could go wrong

  • Scores shift over time for similar content
  • Historical comparisons become noisier
  • Users lose trust if outputs feel unstable
8.3.3

Mitigation

  • Version the scoring system clearly
  • Display the model version used
  • Avoid direct comparison across major model or rubric changes without caveat
8.4

Risk: ambiguous input handling

Mixed-intent or weakly structured pages may be framed incorrectly at Stage 1.

8.4.1

Why this risk exists

The product requires a coherent semantic frame before scoring. Real-world pages do not always provide one.

8.4.2

What could go wrong

  • The wrong entities are expected
  • The wrong relationship standards are applied
  • The resulting score is coherent but mis-targeted
8.4.3

Mitigation

  • Surface Stage 1 outputs prominently
  • Encourage users to validate them before acting on the rest of the result
  • Potentially provide mechanisms to constrain or correct topic framing in future versions
9

Practical interpretation guidance

Taken together, Sections 1–8 imply a clear operating principle for users.

Lexi is most useful when it is treated as a retrieval-structure diagnostic system. It is least useful when treated as a universal content judge, a truth verifier, or a guaranteed predictor of downstream visibility.

Strongest use cases

  • Comparing drafts before and after revision
  • Identifying chunk-level weaknesses in explanatory content
  • Improving citability and extractability for documentation, explainers, and knowledge pages

Weakest use cases

  • Judging factual correctness
  • Predicting traffic or ranking outcomes
  • Optimizing highly branded persuasive copy as if retrieval fitness were the only objective

The right way to use Lexi is to read the score as a summary, the Stage 1 framing as a validity check, and the diagnostics as the main source of editing value.