Methodology · v0.3.0

How TweetSim scores a tweet

A 4-layer scorer with explicit assumptions, an open-source engine, and a calibration loop that tells you when to stop trusting the score.

The honest framing

Predicting tweet performance is a measurement problem with bad data: Twitter's actual ranking algorithm is not fully documented, the public "public_metrics" field excludes impressions on most tiers, and engagement is power-law distributed so averages mislead. We don't pretend to fully model the algorithm — we model the parts that have leaked publicly (the phoenix scoring weights from Twitter's 2023 open-source release), publish the formulas, and let the calibration scorecard catch us when we drift.

Every tweet you score returns four sub-reports. Each is independently interpretable. Each is a number you can disagree with.

Layer 1 — Phoenix-style action scorer

For each of 18 X user actions, predict the probability that a viewer will take that action when they see your tweet:

Positive
favorite, reply, retweet, quote, profile_click, follow_author, dwell, click, share, share_via_dm, share_via_copy_link, photo_expand, video_view
Negative
not_interested, mute_author, block_author, report
Soft
dwell_time (seconds, not p)

Probabilities come from text features (word count, hook punch, reply-trigger detection, sales markers, AI-tell phrases, link presence). Each probability is multiplied by the action's phoenix weight — the same coefficients leaked from Twitter in 2023, sometimes recalibrated against your own historical posts via the calibration loop.

phoenix_score = Σ (P[action] × weight[action])  · normalized to 0-100

A high phoenix score (70+) means the actions Twitter boosts (replies, profile clicks, dwell) outweigh the actions it punishes (not_interested, mute). It's a proxy for "would the For-You ranker push this tweet?" — not for raw like count.

Layer 2 — Composite quality scorer

Phoenix predicts algorithm behavior. The composite scorer predicts human behavior across 10 dimensions, each on a 1-10 scale:

niche_consistencyHow tightly the post sits in its lane vs. drifting off-topic
reply_trigger_scoreHow likely the post invites a reply (questions, contrarian claims)
dwell_time_potentialStop-scroll opening + line-by-line pull
profile_click_potentialDoes the post make readers wonder who wrote it
bookmark_potentialSaveable framework / numbered list / reusable insight
clarityPlain language, one idea, no jargon stacking
sales_relevanceDoes the post serve the offer without pitch-spamming
annoy_riskEngagement bait, hype words, AI tells (this is inverted — high = bad)
hook_strengthPunch of the first 8 words
contrarian_edgeDoes it challenge a common belief vs. restate consensus

The composite score is a weighted blend of these dimensions — virality factors × conversion factors − annoy_risk. Tunable per-account in v1; defaults are calibrated against high-engagement public X posts.

Layer 3 — Hard-block gate

Some patterns are dead-on-arrival regardless of the score above. The gate hard-rejects them with concrete blocker reasons:

emojiEmoji presence flags AI-generated content to most algorithm-aware readers
em-dashU+2014 / U+2013 — strong AI-generation signal in 2024+
AI-tell phrases (25)"in today's fast-paced", "harness the power", "delve into", "as an AI", etc.
engagement bait"RT if you agree", "comment YES for", "like if you" — these throttle reach
multi-hashtag>1 hashtag triggers the algorithm's spam classifier
over 280 charsTwitter's hard limit — would post-truncate or fail
link-onlyTweet that's essentially a URL with no body — low intrinsic engagement

For long-form posts (60+ words with body), the gate also fails on missing reply-trigger, weak hook, or 1/10 clarity — these are the composite scorer's veto rights. For single-line thread items, the composite scorer runs informationally only — its dimensions show in the UI for context but don't gate.

Layer 4 — View-growth curve (Phase 1)

Predicted impressions over time, fitted to a Gompertz function:

V(t) = K · exp(−b · exp(−c·t))

where:

  • K = ceiling impressions (asymptote, predicted from engagement regressor or phoenix-derived estimate)
  • b = horizontal displacement (controls when the inflection happens)
  • c = growth rate (steepness of the rise)

The output is a curve at t+5min · 30min · 1h · 6h · 24h with 80% confidence intervals from Monte Carlo on (b, c). Three headline numbers fall out:

t₅₀
Minutes to 50% of K
t₉₀
Minutes to 90% of K
velocity
views(30min) / K · the algorithm-friendliness number

Velocity is the headline metric. Twitter's For-You algorithm gates out-of-network distribution on early engagement velocity (likes per impression in roughly the first 10-30 minutes). A velocity ≥ 0.6 means your tweet is predicted to hit most of its ceiling fast — algorithm-friendly. ≤ 0.3 means it's predicted to crawl — likely throttled before it leaves your follower graph.

Today the model runs cold-start — using a generic Gompertz prior that predicts t₅₀ ≈ 29min and velocity ≈ 0.52 for a typical tweet. As your account accumulates ~30+ tweets each with multiple metric snapshots, the model retrains on your distribution and the predictions become account-specific.

Calibration scorecard — when to stop trusting the score

Every model drifts. We compute a per-action calibration scorecard that tells you when to recalibrate:

  • Pearson r between predicted and observed metrics per action
  • Spearman rank overlap on top-K winners — does the simulator's top-10 match your actual top-10?
  • MAE on log-engagement in the engagement regressor
  • Holdout LOO test on the view-curve model

When Pearson r drops below 0.25 or top-K overlap drops below 50%, the scorecard turns red. That's the signal to either retrain (if you have new data) or stop trusting the score until you do. We do not silently overwrite weights — there's an explicit safety check that refuses calibration runs with insufficient data.

What this is NOT

  • Not a like-count predictor. "Will get 14 likes" is the wrong question. We predict action probabilities and trajectories.
  • Not a writing assistant. The generator exists, but it's for ranking variants — not for replacing your voice. Brand brief steers the output.
  • Not a Twitter algorithm reverse-engineering. We model what's public. Topic-cluster amplification, account history weighting, and the For-You ranker's downstream decisions are out of scope (see roadmap Phase 2 & 3).
  • Not a vibe-check from an LLM. Phoenix and composite are deterministic given input. The LLM audience panel is optional — it adds per-archetype reaction signals but is never the primary score.