How TweetSim scores a tweet
A 4-layer scorer with explicit assumptions, an open-source engine, and a calibration loop that tells you when to stop trusting the score.
The honest framing
Predicting tweet performance is a measurement problem with bad data: Twitter's actual ranking algorithm is not fully documented, the public "public_metrics" field excludes impressions on most tiers, and engagement is power-law distributed so averages mislead. We don't pretend to fully model the algorithm — we model the parts that have leaked publicly (the phoenix scoring weights from Twitter's 2023 open-source release), publish the formulas, and let the calibration scorecard catch us when we drift.
Every tweet you score returns four sub-reports. Each is independently interpretable. Each is a number you can disagree with.
Layer 1 — Phoenix-style action scorer
For each of 18 X user actions, predict the probability that a viewer will take that action when they see your tweet:
Probabilities come from text features (word count, hook punch, reply-trigger detection, sales markers, AI-tell phrases, link presence). Each probability is multiplied by the action's phoenix weight — the same coefficients leaked from Twitter in 2023, sometimes recalibrated against your own historical posts via the calibration loop.
phoenix_score = Σ (P[action] × weight[action]) · normalized to 0-100
A high phoenix score (70+) means the actions Twitter boosts (replies, profile clicks, dwell) outweigh the actions it punishes (not_interested, mute). It's a proxy for "would the For-You ranker push this tweet?" — not for raw like count.
Layer 2 — Composite quality scorer
Phoenix predicts algorithm behavior. The composite scorer predicts human behavior across 10 dimensions, each on a 1-10 scale:
| niche_consistency | How tightly the post sits in its lane vs. drifting off-topic |
| reply_trigger_score | How likely the post invites a reply (questions, contrarian claims) |
| dwell_time_potential | Stop-scroll opening + line-by-line pull |
| profile_click_potential | Does the post make readers wonder who wrote it |
| bookmark_potential | Saveable framework / numbered list / reusable insight |
| clarity | Plain language, one idea, no jargon stacking |
| sales_relevance | Does the post serve the offer without pitch-spamming |
| annoy_risk | Engagement bait, hype words, AI tells (this is inverted — high = bad) |
| hook_strength | Punch of the first 8 words |
| contrarian_edge | Does it challenge a common belief vs. restate consensus |
The composite score is a weighted blend of these dimensions — virality factors × conversion factors − annoy_risk. Tunable per-account in v1; defaults are calibrated against high-engagement public X posts.
Layer 3 — Hard-block gate
Some patterns are dead-on-arrival regardless of the score above. The gate hard-rejects them with concrete blocker reasons:
| emoji | Emoji presence flags AI-generated content to most algorithm-aware readers |
| em-dash | U+2014 / U+2013 — strong AI-generation signal in 2024+ |
| AI-tell phrases (25) | "in today's fast-paced", "harness the power", "delve into", "as an AI", etc. |
| engagement bait | "RT if you agree", "comment YES for", "like if you" — these throttle reach |
| multi-hashtag | >1 hashtag triggers the algorithm's spam classifier |
| over 280 chars | Twitter's hard limit — would post-truncate or fail |
| link-only | Tweet that's essentially a URL with no body — low intrinsic engagement |
For long-form posts (60+ words with body), the gate also fails on missing reply-trigger, weak hook, or 1/10 clarity — these are the composite scorer's veto rights. For single-line thread items, the composite scorer runs informationally only — its dimensions show in the UI for context but don't gate.
Layer 4 — View-growth curve (Phase 1)
Predicted impressions over time, fitted to a Gompertz function:
V(t) = K · exp(−b · exp(−c·t))
where:
- K = ceiling impressions (asymptote, predicted from engagement regressor or phoenix-derived estimate)
- b = horizontal displacement (controls when the inflection happens)
- c = growth rate (steepness of the rise)
The output is a curve at t+5min · 30min · 1h · 6h · 24h with 80% confidence intervals from Monte Carlo on (b, c). Three headline numbers fall out:
Velocity is the headline metric. Twitter's For-You algorithm gates out-of-network distribution on early engagement velocity (likes per impression in roughly the first 10-30 minutes). A velocity ≥ 0.6 means your tweet is predicted to hit most of its ceiling fast — algorithm-friendly. ≤ 0.3 means it's predicted to crawl — likely throttled before it leaves your follower graph.
Today the model runs cold-start — using a generic Gompertz prior that predicts t₅₀ ≈ 29min and velocity ≈ 0.52 for a typical tweet. As your account accumulates ~30+ tweets each with multiple metric snapshots, the model retrains on your distribution and the predictions become account-specific.
Calibration scorecard — when to stop trusting the score
Every model drifts. We compute a per-action calibration scorecard that tells you when to recalibrate:
- Pearson r between predicted and observed metrics per action
- Spearman rank overlap on top-K winners — does the simulator's top-10 match your actual top-10?
- MAE on log-engagement in the engagement regressor
- Holdout LOO test on the view-curve model
When Pearson r drops below 0.25 or top-K overlap drops below 50%, the scorecard turns red. That's the signal to either retrain (if you have new data) or stop trusting the score until you do. We do not silently overwrite weights — there's an explicit safety check that refuses calibration runs with insufficient data.
What this is NOT
- Not a like-count predictor. "Will get 14 likes" is the wrong question. We predict action probabilities and trajectories.
- Not a writing assistant. The generator exists, but it's for ranking variants — not for replacing your voice. Brand brief steers the output.
- Not a Twitter algorithm reverse-engineering. We model what's public. Topic-cluster amplification, account history weighting, and the For-You ranker's downstream decisions are out of scope (see roadmap Phase 2 & 3).
- Not a vibe-check from an LLM. Phoenix and composite are deterministic given input. The LLM audience panel is optional — it adds per-archetype reaction signals but is never the primary score.