Generative AI for Good: Detecting Misinformation Online
Combining Generative AI, Traditional Machine Learning, and Human-in-the-Loop to detect misinformation online.
Abstract
This project presents a hybrid approach to detecting misinformation in online news articles by analyzing six key factuality factors: Clickbait, Headline-Body Relation, Political Affiliation, Sensationalism, Sentiment Analysis, and Toxicity. We combine traditional machine learning models, generative AI (LLMs), and multi-agent architectures to create a comprehensive news article analysis system. Our approach demonstrates that combining specialized predictive models with LLM-based reasoning can provide nuanced, interpretable assessments of article quality and potential misinformation indicators. The system includes a Streamlit demo application, a Google ADK multi-agent pipeline, and a comprehensive evaluation harness for benchmarking against ground truth data.
Introduction
In an era of information overload, distinguishing credible journalism from misinformation, sensationalism, and clickbait has become increasingly challenging for everyday readers. News consumers face articles that may use emotionally charged language, misleading headlines, or biased framing—all of which can distort public understanding of important issues.
Our project addresses this challenge by building an automated system that evaluates news articles across multiple dimensions of credibility and quality. Rather than making a binary "fake news" determination, we assess articles on six distinct factuality factors:
- Clickbait: Does the headline use sensational or deceptive language to bait clicks?
- Headline-Body Relation: How well does the headline represent the actual content?
- Political Affiliation: Does the article show partisan lean?
- Sensationalism: Does the article use emotional language to evoke strong reactions?
- Sentiment Analysis: What is the overall emotional tone (positive, negative, neutral)?
- Toxicity: Does the language contain hostile, offensive, or dehumanizing content?
This multi-dimensional approach provides readers with actionable insights about article quality without oversimplifying the complex nature of news credibility.
Motivation
The Problem: Information Ecosystem Pollution
Misinformation and low-quality journalism have real-world consequences. They erode trust in institutions, polarize communities, and can influence critical decisions from voting to public health. However, existing fact-checking approaches face significant limitations:
- Binary classifications (true/false) fail to capture the nuanced ways articles can mislead
- Manual fact-checking doesn't scale to the volume of content published daily
- Pure AI approaches often lack transparency and explainability
- Single-factor analysis misses the multifaceted nature of article quality
Our Solution: Multi-Factor Analysis
We hypothesize that analyzing multiple factuality factors provides a richer, more actionable assessment than any single metric. For example, an article might be low on clickbait but high on sensationalism and partisan bias. This combination suggests a different credibility profile than an article that's clickbait-heavy but politically neutral.
Target Users & Impact
Our system serves multiple stakeholders:
- News Consumers: Empowered to make informed decisions about article credibility
- Educators: Can use the tool to teach media literacy
- Content Moderators: Can prioritize manual review of potentially problematic content
- Researchers: Can study patterns in news coverage at scale
- Journalists: Can use self-assessment tools to improve their writing
Why Hybrid AI?
We combine traditional ML models with generative AI because each has complementary strengths:
- Traditional ML: Fast, consistent, trained on domain-specific patterns
- LLMs: Contextually aware, can handle nuance, provide explanations
- Multi-Agent Systems: Enable specialized reasoning for each factor with coordinated output
Our Approach
System Architecture
Our solution consists of three integrated components:
1. Specialized Predictive Models
Each factuality factor is implemented as a standalone model class inheriting from FactualityFactor:
- Clickbait Model: XGBoost classifier trained on headline embeddings from the Kaggle Clickbait Dataset
- Headline-Body Relation Model: Cosine similarity between embedded headline and article body text
- Political Affiliation Model: Logistic regression classifier trained on Gemini embeddings to detect Democratic, Republican, Neutral, or Other stance
- Sensationalism Model: Multinomial classifier over emotional feature embeddings using GoEmotions taxonomy
- Sentiment Analysis: VADER-based sentiment model optimized for news text
- Toxicity Model: RoBERTa-based multi-class classifier (Friendly → Neutral → Rude → Toxic → Super Toxic)
2. LLM-Based Evaluation Pipeline
We leverage Large Language Models (Gemini via AI Studio and OpenRouter) to provide:
- Holistic article analysis considering all six factors simultaneously
- Natural language explanations for scores
- Contextual understanding that traditional models may miss
Our prompts implement a Fractal Chain-of-Thought (FCoT) reasoning protocol with Distortion-to-Information Ratio (DIR) calculation and Conservative Inference Minimization (CIM) to reduce overconfident predictions.
3. Multi-Agent System (Google ADK)
Built using Google Agent Development Kit (ADK), our agent system features:
(Headline / Body / Text)
Figure: Agentic architecture for multi-factor factuality analysis. The root agent routes requests to factor-specific sub-agents, which call predictive tools and return structured outputs.
Agentic Workflow Design
Design Motivation: Pure LLM prompting suffers from inconsistency, hallucination, and opacity. Our agentic design grounds factor-level reasoning in predictive models, with the LLM acting as a controller and coordinator rather than the sole reasoning engine.
Root Orchestrator Agent serves as the high-level controller:
- Parses user requests and determines required factuality factors
- Invokes appropriate sub-agents sequentially or in parallel
- Aggregates outputs into a single structured JSON response
- Separates decision logic (LLM) from predictive inference (tools)
Factor-Specific Sub-Agents are independent components that:
- Validate structured input via Pydantic schemas
- Call their associated predictive model as a tool
- Return JSON containing scores, labels, and confidence values
This ensures predictions are deterministic and grounded in trained statistical models rather than generated heuristically.
Advantages of Agentic Design:
- Modularity: Each factor can be independently developed and improved
- Interpretability: Outputs are structured and model-backed
- Scalability: New factors can be added as new sub-agents
- Reduced Hallucination: LLM is constrained to tool-calling rather than speculative reasoning
- Experimental Flexibility: Can compare LLM-only, predictive-only, and hybrid configurations
Inference Flow
- Input: Article headline, body text, and optional URL
- Parallel Processing: Each factor's model generates initial predictions
- LLM Enhancement: Gemini provides contextual analysis and scoring
- Score Fusion: Combine model and LLM predictions weighted by confidence
- Output: Factor scores with explanations and overall credibility profile
Key Technical Decisions
- Hybrid Architecture: Balances speed (models) with understanding (LLMs)
- Factor Independence: Each factor is evaluated separately to avoid cascade effects
- Explainability: All scores include reasoning to build user trust
- Modular Design: Easy to update individual models or swap LLM providers
Data Collection
Ground Truth Dataset
We curated a diverse ground truth dataset containing 35 news articles from various sources spanning the political spectrum and topic areas. Articles were manually annotated across all six factuality factors by our team through careful deliberation and consensus.
Dataset Characteristics
Source Diversity
Articles were collected from:
- Mainstream outlets: CNN, Fox News, The Washington Post, AP News, BBC
- Fact-checking sites: PolitiFact, FactCheck.org
- Partisan media: Newsmax, The Daily Beast, Mother Jones
- Satire: The Onion (to test edge cases)
- Alternative media: Infowars, New York Post opinion pieces
Topic Coverage
- Immigration policy and ICE enforcement
- Political figures and governance
- International relations
- Local news (seal pups, theater reviews)
- Sports and entertainment
Annotation Schema
Each article was scored on six dimensions:
- Clickbait: 0-100 scale (0 = factual, 100 = pure clickbait)
- Political Affiliation: Democratic, Republican, Neutral, Other
- Sensationalism: 0-100 scale (0 = objective, 100 = extremely sensational)
- Sentiment Analysis: Positive, Negative, Neutral
- Headline-Body Relation: 0-100 scale (0 = unrelated, 100 = perfect match)
- Toxicity: Friendly, Neutral, Rude, Toxic, Super_Toxic
Data Quality Measures
- Multiple annotators: Team consensus on ambiguous cases
- Clear rubrics: Detailed scoring guidelines for consistency
- Edge case inclusion: Satire and highly biased sources to test boundaries
- Balanced representation: Mix of high/low scores across all factors
Training Data for Models
Clickbait Model
Trained on the Kaggle Clickbait Dataset containing 32,000+ labeled headlines from various news sources and social media.
Political Affiliation Model
Trained using Gemini embeddings on a curated dataset of articles from known partisan sources, labeled by publication bias.
Sensationalism Model
Trained using GoEmotions emotion taxonomy with Gemini embeddings to detect emotionally charged language patterns indicative of sensationalism.
Other Models
- Sentiment: Leverages pre-trained VADER and TextBlob models
- Toxicity: Uses patterns from Perspective API training
- Headline-Body Relation: Purely LLM-based semantic similarity
Model Training
Training Infrastructure
Models were trained using a combination of classical ML frameworks (scikit-learn) and modern LLM APIs (Gemini, OpenRouter). Each factor's model was developed independently as a modular component inheriting from the FactualityFactor base class.
Factor-Specific Training Details
Clickbait Model
- Architecture: Binary classifier (clickbait vs. non-clickbait)
- Features: TF-IDF vectors of headline text
- Model: Logistic Regression / Random Forest
- Training Data: 32K+ labeled headlines from Kaggle
- Performance Focus: High precision to avoid false positives
Political Affiliation Model
- Architecture: Multi-class classifier (Democratic/Republican/Neutral/Other)
- Features: Gemini API embeddings (768-dimensional)
- Model: Gradient Boosting (saved as
political_affiliation_gemini.joblib) - Training: Articles from sources with known editorial slants
- Challenge: Distinguishing neutral reporting from centrist perspective
Sensationalism Model
- Architecture: Binary classifier (sensational vs. non-sensational)
- Features: Emotion distributions from GoEmotions + Gemini embeddings
- Model: Ensemble saved as
sensationalism_gemini_goemotions.joblib - Emotion Focus: Anger, fear, surprise, disgust as sensationalism indicators
- Threshold Tuning: 60+ sensationalism score triggers "Sensational" label
Sentiment Analysis Model
- Architecture: Composite of multiple pre-trained models
- Components: VADER (social media text), TextBlob (general sentiment)
- Output: Averaged polarity scores mapped to Positive/Negative/Neutral
- No Training Required: Uses established lexicons
Toxicity Model
- Architecture: 5-level classifier (Friendly → Super_Toxic)
- Approach: Pattern matching + API-based scoring
- Levels: Gradual escalation from neutral to explicitly harmful content
- Training: Based on Perspective API's toxicity taxonomy
Headline-Body Relation Model
- Architecture: LLM-based semantic similarity
- Method: Cosine similarity of embeddings + LLM contextual scoring
- API: Gemini/OpenRouter for embedding generation
- Output: 0-100 similarity score
Training Methodology
- Data Preparation: Text cleaning, tokenization, feature extraction
- Model Selection: Experimented with multiple architectures per factor
- Hyperparameter Tuning: Grid search for optimal configurations
- Cross-Validation: 5-fold CV to prevent overfitting
- Model Serialization: Saved as
.joblibor.jsonfiles for deployment
Training Artifacts
All trained models are stored in the models/ directory:
clickbait/clickbait_model.jsonpolitical_affiliation/political_affiliation_gemini.joblibsensationalism/sensationalism_gemini_goemotions.joblib- Party labels:
political_affiliation/party_labels.json
Model Versioning & Updates
Models are versioned and can be retrained as new data becomes available. The modular architecture allows updating individual factors without affecting the entire system.
Prompt Engineering
LLM Prompt Design Philosophy
Our prompt engineering strategy focuses on structured reasoning to produce consistent, explainable outputs. We implemented a custom Fractal Chain-of-Thought (FCoT) protocol with dual optimization objectives.
Fractal Chain-of-Thought (FCoT) Protocol
The FCoT protocol guides LLMs through recursive self-correction across six reasoning layers. We developed two versions: FCoT v1 (basic fractal reasoning) and FCoT v2 (dual-objective optimization).
FCoT v2: Dual-Objective Fractal Optimization
FCoT v2 implements two explicit objective functions to improve calibration and reduce overconfident predictions:
Objective 1: Distortion-to-Information Ratio (DIR)
We defined:
Where:
- Informational Units (IU): Factual, attributed, verifiable statements containing statistics, named entities, or direct attribution
- Distortion Units (DU): Hyperbole, urgency framing, emotional intensifiers, extreme certainty markers
Interpretation: Higher DIR indicates stronger likelihood of clickbait or sensationalism. DIR > 0.75 = high distortion, DIR 0.3–0.75 = moderate, DIR < 0.3 = low distortion.
Objective 2: Conservative Inference Minimization (CIM)
CIM explicitly constrains the model to:
- Require consistent partisan framing for political affiliation classification (not just mention of political figures)
- Base toxicity assessments on explicit wording only
- Use ONLY information present in the article text—no external knowledge assumptions
- Minimize overconfident labeling without textual justification
Recursive Reasoning Layers
FCoT v2 requires six internal processing layers:
Layer 1: Local Signal Tagging (Micro-Level)
- Scan each sentence and label as IU / DU / Neutral
- Compute preliminary DIR score
- Only count explicit linguistic features, not subjective interpretation
Layer 2: Local Error Minimization
- Correct for quoted speech being misclassified as author stance
- Distinguish emotional reporting of tragic events from sensationalism
- Reduce false positive distortion tagging
Layer 3: Aperture Expansion (Document-Level)
- Compare headline vs. body distortion density
- Check if emotional intensity is proportionate to content
- Identify balancing facts introduced later in the article
Layer 4: Fractal Consistency Check
- Enforce symmetry: High DIR → higher Clickbait and Sensationalism
- Low DIR + high IU density → higher Headline-Body Relation score
- DIR should NOT automatically determine Political Affiliation or Toxicity
Layer 5: Inter-Agent Reflective Check
- Simulate two internal evaluators: Skeptical Auditor vs. Conservative Baseline Analyst
- Resolve disagreements by choosing the more conservative classification
- Minimize overconfident labeling and ambiguity inflation
Layer 6: Temporal Re-Grounding
- Re-scan the full article before finalizing
- Check for late-stage clarifications or corrections
- Ensure output reflects full-text evaluation, not first-impression anchoring
Conservative Inference Minimization (CIM)
CIM is our second optimization objective, designed to minimize overconfident labeling:
- Classifications must be justified by explicit textual evidence
- No assumptions from external context or world knowledge
- Political affiliation requires consistent partisan framing, not isolated statements
- When uncertain, select the more neutral classification
Factor-Specific Prompts
Clickbait Scoring Rubric
- 0-20: Factual, straightforward, informative
- 21-50: Slightly curiosity-driven but mostly accurate
- 51-80: Strong clickbait (all-caps, "you won't believe")
- 81-100: Pure clickbait, deceptive or extremely sensationalized
Political Affiliation Guidelines
- Must show consistent partisan framing across multiple passages
- Neutral = objective reporting or balanced coverage
- Other = framing that doesn't align with US two-party system
Sensationalism Detection
- Focus on emotional language intended to evoke strong reactions
- Distinguish between reporting on emotional topics vs. using emotional language
- High sensationalism = disproportionate emotion relative to factual content
Prompt Templates
All prompts follow a consistent structure:
- Role Definition: "You are a Fractal Reasoning Agent..."
- Objective Statement: Dual objectives (DIR + CIM)
- Reasoning Protocol: Six-layer FCoT process
- Output Constraints: JSON format, no intermediate reasoning
- Factor-Specific Rubric: Scoring guidelines with examples
Prompt Iteration & Testing
Prompts were refined through iterative testing:
- Version 1: Simple rubric-based scoring (inconsistent results)
- Version 2: Added few-shot examples (improved but still variable)
- Version 3 (Current): FCoT + CIM protocol (significantly more consistent)
Multi-Agent Orchestration Prompt
The orchestrator agent coordinates sub-agents and synthesizes results:
- Receives article input (headline + body)
- Delegates to six specialized agents in parallel
- Combines scores using the
combine_scorestool - Generates natural language summary of findings
- Highlights potential credibility concerns
Metrics
Evaluation Framework
We evaluate our system using a comprehensive testing harness that compares LLM predictions against ground truth annotations across all six factuality factors. The evaluation framework is implemented in evals/ and supports parallel execution for faster benchmarking.
Metrics by Factor Type
Numeric Factors (Clickbait, Sensationalism, Headline-Body Relation)
For factors scored on a 0-100 scale, we compute:
- Mean Absolute Error (MAE): Average absolute difference between prediction and ground truth
- Root Mean Squared Error (RMSE): Square root of mean squared differences (penalizes larger errors)
- Tolerance-based Accuracy: Percentage of predictions within acceptable threshold (default tolerance = 0.1 or 10 points)
Formula: MAE = (1/n) Σ |predicted - actual|
Interpretation: Lower MAE/RMSE = better performance. Tolerance-based accuracy measures practical agreement within realistic margins.
Categorical Factors (Political Affiliation, Sentiment, Toxicity)
For categorical predictions, we compute:
- Accuracy: Percentage of exact matches with ground truth
- Weighted F1 Score: Accounts for class imbalance in the ground truth dataset
- Confusion Matrix: Visualizes which categories are confused with each other
Formula: F1 = 2 × (Precision × Recall) / (Precision + Recall)
Interpretation: Weighted F1 handles class imbalance better than macro-average. Confusion matrices reveal systematic biases and misclassification patterns.
Overall System Metrics
- Cross-Factor Accuracy: Average accuracy across all six factors
- Execution Time: Average time per article (important for scalability)
- Error Rate: Percentage of articles that fail to process
- Consistency Score: Agreement between model and LLM predictions
Evaluation Methodology
- Ground Truth Loading: Read annotated CSV from
data/ground_truth.csv - Parallel Processing: Configure number of worker threads (default: 5)
- Article Processing: For each article, generate predictions for all six factors
- Comparison: Match LLM outputs to ground truth using factor-appropriate metrics
- Logging: Save individual results to timestamped CSV in
evals/logs/ - Aggregation: Compute summary statistics and append to master log
Evaluation Logging
Every evaluation run produces three artifacts:
- Detailed Log:
evaluation_logs_YYYYMMDD_HHMMSS.csvwith per-article results - Metadata:
metadata/metadata_YYYYMMDD_HHMMSS.jsonwith prompts and parameters - Summary:
metadata/summary_YYYYMMDD_HHMMSS.jsonwith aggregated metrics
Validation & Quality Checks
- CSV Validation:
evals/validate_csv.pyensures ground truth format compliance - Normalization: Handles both 0-1 and 0-100 scales automatically
- Missing Value Handling: Graceful handling of incomplete annotations
- Category Mapping: Normalizes label variations (e.g., "Democrat" → "Democratic")
Baseline Comparisons
We compare our hybrid approach against three baselines:
- Model-Only: Predictions from trained models without LLM enhancement
- LLM-Only: Pure LLM predictions without specialized models
- Random Baseline: Random predictions within valid ranges
Hypothesis: The hybrid approach outperforms either component alone by combining speed (models) with contextual understanding (LLMs).
Limitations & Considerations
- Small Ground Truth: 35 articles limits statistical confidence
- Annotation Subjectivity: Some factors (e.g., sensationalism) have inherent interpretation variance
- API Variability: LLM outputs may vary slightly between runs (temperature > 0)
- Cost-Accuracy Tradeoff: More sophisticated LLMs cost more but may not always improve accuracy
Results & Analysis
Prompting Techniques Evaluation
We systematically compared six prompting methods to measure the impact of prompt architecture on factuality assessment. Rather than treating prompting as a stylistic choice, we treated it as a controlled experimental variable.
Prompting Methods Compared
- Base Prompt: Minimal instructions with factor definitions only
- In-Context Learning (ICL): Two fully structured examples (low-clickbait neutral + high-clickbait sensational)
- Chain-of-Thought + ICL: Structured reasoning steps with examples
- Chain-of-Thought (CoT) only: Structured reasoning without examples
- Fractal Chain-of-Thought v1 (FCoT1): Recursive reasoning with basic constraints
- Fractal Chain-of-Thought v2 (FCoT2): Dual-objective optimization (DIR + CIM)
Overall Accuracy Results
| Prompting Method | Overall Accuracy |
|---|---|
| Base Prompt | 38.57% |
| In-Context Learning (ICL) | 55.21% |
| CoT + ICL | 49.43% |
| CoT (no ICL) | 54.44% |
| Fractal CoT v1 (FCoT1) | 64.17% |
| Fractal CoT v2 (FCoT2) | 60.00% |
Overall accuracy reflects percentage agreement across all six factuality factors.
Key Findings
- FCoT1 achieved the highest overall performance at 64.17%—a 25.6 percentage point improvement over the base prompt
- ICL substantially improved performance from 38.57% to 55.21%, confirming that calibration examples stabilize model outputs
- CoT without examples outperformed CoT+ICL (54.44% vs 49.43%), suggesting reasoning steps can sometimes conflict with example patterns
- FCoT2 showed strong factor-specific performance: 70% clickbait, 70% sensationalism, 80% political affiliation
- Base prompt catastrophically failed on Headline–Body Relation (0%) and Sensationalism (0%)
Factor-Level Performance Observations
⚠️ Toxicity Instability in FCoT2: While FCoT2 improved distortion detection, toxicity accuracy dropped to only 20%. The recursive reasoning layers appear to have introduced an unintended conservatism bias that suppresses explicit insult detection when distortion analysis dominates attention.
⚠️ Sensationalism Collapse Under CoT+ICL: CoT+ICL resulted in only 6.90% accuracy on sensationalism. The model over-weighted example patterns and failed to generalize distortion detection to new linguistic forms.
Confirmation: These findings confirm that prompt structure is architectural, not stylistic. Different prompting methods induce fundamentally different reasoning behaviors and failure patterns.
System Capabilities Demonstrated
Multi-Factor Analysis
Our system successfully analyzes news articles across six distinct dimensions simultaneously. The demo application (demo.py) provides an interactive interface where users can input any article and receive comprehensive factuality assessments.
Model Performance Characteristics
- Clickbait Detection: Model identifies clear patterns like "You won't believe...", question-based hooks, and sensational adjectives
- Headline-Body Relation: LLM-based semantic similarity effectively measures alignment between headline promises and actual content
- Political Affiliation: Classifier detects partisan framing from sources like Fox News (Republican) vs. MSNBC (Democratic)
- Sensationalism: Emotion-based approach identifies emotionally charged language patterns
- Sentiment: Composite model handles both neutral news and opinion pieces
- Toxicity: Five-level classification captures gradations from friendly to extremely toxic
Qualitative Observations
Hybrid Architecture Benefits
- Speed: Traditional models provide instant predictions (< 100ms per factor)
- Context: LLMs add nuanced understanding that models might miss
- Explainability: LLM outputs include reasoning, building user trust
- Robustness: When model fails or is uncertain, LLM provides fallback
Edge Cases & Challenges
- Satire Detection: The Onion articles score high on clickbait and sensationalism (as expected), but the system doesn't explicitly label them as satire—this is a known limitation
- Fact-Checking vs. Factuality: Our system assesses writing style and presentation, not factual accuracy of claims. An article can be low-sensationalism but contain false information
- Context-Dependent Toxicity: Political news may contain quoted toxic language without the article itself being toxic. The FCoT protocol helps distinguish these cases
- Neutral vs. Balanced: Distinguishing truly neutral reporting from articles that present both sides can be challenging for political affiliation scoring
Example Analysis
Article: "Video: 'Gay Batman' Has Meltdown At City Council Meeting Over ICE"
Source: Infowars
- Clickbait: 95/100 (highly sensational keyword-stuffed headline)
- Political Affiliation: Republican (critical of immigration protesters)
- Sensationalism: 95/100 (emotionally charged language: "lunatic leftist", "meltdown")
- Sentiment: Negative (mocking tone toward subject)
- Headline-Body Relation: 75/100 (headline accurately describes event but uses inflammatory framing)
- Toxicity: Super_Toxic (dehumanizing language, inflammatory labels)
Analysis: This example demonstrates how articles can score high across multiple factors simultaneously. The system correctly identifies extreme cases of bias and inflammatory language.
Article: "First La Jolla seal pup of 2026 born"
Source: Fox 5 San Diego (Local News)
- Clickbait: 25/100 (straightforward, factual headline)
- Political Affiliation: Neutral (non-political local interest story)
- Sensationalism: 35/100 (celebratory but not sensational)
- Sentiment: Positive (uplifting news)
- Headline-Body Relation: 100/100 (perfect match)
- Toxicity: Friendly (wholesome content)
Analysis: Demonstrates the system can correctly identify high-quality, objective reporting and differentiate it from problematic content.
Model vs. LLM Comparison
In our hybrid architecture, we observed:
- Strong Agreement: For clear-cut cases (e.g., obvious clickbait), models and LLMs align closely
- LLM Advantages: Better at handling sarcasm, context-dependent language, and complex multi-paragraph structure
- Model Advantages: More consistent across similar inputs, less susceptible to prompt variations
- Disagreements: Usually occur in borderline cases where human annotators also show lower agreement
User Interface & Accessibility
The Streamlit demo application provides:
- Real-time article analysis with progress indicators
- Side-by-side comparison of Model vs. LLM predictions
- Visual presentation of scores with color-coded indicators
- Model status dashboard showing which components are loaded
- Error handling for API failures and timeouts
Scalability & Performance
- Throughput: System can process ~5-10 articles per minute with parallel workers
- Bottleneck: LLM API calls are the limiting factor (rate limits, latency)
- Optimization: Caching model predictions for repeated articles reduces redundant computation
- Cost: Approximately $0.01-0.05 per article for LLM inference depending on article length
Limitations Acknowledged
- Ground Truth Size: 35-article test set limits confidence in quantitative metrics
- Source Diversity: Dataset is heavily weighted toward US political news; may not generalize to other domains
- Temporal Coverage: Articles are from 2026; models may not capture emerging language patterns
- Language: System only supports English-language articles
- Satire Detection: No explicit satire classifier; satirical articles may score as highly problematic
- API Dependency: Requires external API keys (OpenRouter, Gemini) for full functionality
Discussion
Key Contributions
1. Multi-Dimensional Credibility Framework
Unlike binary "fake news" classifiers, our six-factor approach provides a nuanced credibility profile. This aligns with how humans actually assess article quality—we don't just ask "is this fake?" but rather "is it clickbait?", "is it biased?", "is it toxic?"
2. Hybrid Architecture Design
Combining traditional ML models with LLMs demonstrates that specialized models and general-purpose AI can complement each other. This hybrid approach:
- Provides fast, consistent baseline predictions (models)
- Adds contextual understanding and flexibility (LLMs)
- Enables graceful degradation (if one component fails, the other provides coverage)
3. Fractal Chain-of-Thought Prompting
Our FCoT protocol with DIR and CIM optimization demonstrates a systematic approach to prompt engineering that:
- Reduces LLM overconfidence through recursive self-correction
- Provides a principled framework for assessing information density vs. distortion
- Can be adapted to other NLP tasks requiring careful reasoning
4. Multi-Agent Orchestration
The Google ADK implementation shows how specialized agents can collaborate on complex analysis tasks. Each agent focuses on one factuality factor, then a coordinator synthesizes results into a coherent report.
Implications for Misinformation Detection
Beyond Binary Classification
Traditional misinformation detection often frames the problem as binary classification (real/fake). Our work suggests value in dimensional assessment—measuring articles along multiple axes rather than forcing them into categories. This provides actionable insights:
- A highly clickbait headline with objective body content suggests headline issues, not article content problems
- High sensationalism + high toxicity + partisan bias is a strong warning sign
- Low scores across all factors indicate high-quality journalism
Human-in-the-Loop Augmentation
Our system is designed to augment human judgment, not replace it. By providing factor scores with explanations, we empower users to:
- Understand why an article might be problematic
- Make informed decisions about whether to trust the content
- Develop better media literacy skills over time
Documented Failure Modes
Following best practices in ML research, we prioritize documenting failure modes over cherry-picking successes. Understanding when and why the system fails is critical for future improvements.
1. Sensationalism Collapse Under CoT+ICL
Observation: CoT+ICL achieved only 6.90% accuracy on sensationalism—worse than random guessing.
Diagnosis: The model over-weighted example patterns and failed to generalize distortion detection to new linguistic forms. Adding reasoning steps did not guarantee improved classification.
Implication: Structured reasoning must be combined with robust generalization mechanisms, not just imitation of examples.
2. Political Affiliation Overgeneralization
Observation: Multiple prompting methods incorrectly classified articles mentioning Republican figures as having Republican affiliation.
Diagnosis: The model conflated topic presence (named entities) with ideological stance (framing consistency), violating our Conservative Inference Minimization constraint.
Implication: Affiliation detection requires analyzing framing and rhetoric, not just counting entity mentions.
3. Toxicity Suppression in FCoT2
Observation: FCoT2 improved distortion detection (70% clickbait, 70% sensationalism) but toxicity accuracy dropped to 20%.
Diagnosis: The recursive reasoning layers prioritized distortion analysis and introduced excessive conservatism through CIM constraints, causing the model to underdetect explicit insults.
Implication: Multi-objective optimization can create unintended factor trade-offs. Future versions need dynamic attention allocation across factors.
4. Instruction Drift Across Layers
Observation: Earlier versions of fractal prompts exhibited subtle drift between factor definitions across reasoning layers.
Diagnosis: Without explicit cross-factor consistency checks, distortion signals sometimes leaked into political labeling.
Implication: Fractal reasoning requires explicit inter-layer validation to maintain definitional consistency.
5. Calibration Collapse with Base Prompt
Observation: Base prompt achieved 0% on Headline–Body Relation and 0% on Sensationalism.
Diagnosis: Without examples or structured reasoning, the model misinterpreted these factors entirely, treating them as binary rather than scalar.
Implication: Minimal prompting is insufficient for complex multi-factor evaluation tasks.
Lessons Learned
Prompting as an Engineering Variable
One of our central goals was to move beyond "LLM as a black box" and treat prompting as a controlled experimental variable. Our systematic comparison of six methods demonstrates that prompt structure significantly and measurably impacts performance.
Key Insight: Prompting is not stylistic—it is architectural. Different structures induce different reasoning behaviors.
From Prompting to Agent Engineering
The results demonstrate that prompting improvements alone are insufficient. While fractal reasoning improved distortion detection, it introduced trade-offs across other factors (especially toxicity).
Conclusion: This reinforces the need for agent-level architecture improvements rather than continued ad hoc prompt tuning. The next phase prioritizes structured tool-calling, rationalization, and dynamic factor weighting.
Ethical Considerations
Avoiding Censorship
Our tool is designed for information, not censorship. We provide assessments, not removal recommendations. Users decide how to act on the information.
Bias in Training Data
Our political affiliation model is trained on articles from outlets with known leans. This means:
- The model learns current political alignments, which may shift over time
- Emerging political movements may be misclassified as "Other"
- We should regularly retrain as the political landscape evolves
Transparency & Explainability
By providing reasoning alongside scores, we aim for transparency. However:
- LLM reasoning may be post-hoc rationalization rather than true explanation
- Users may over-trust AI assessments if presented authoritatively
- We need to clearly communicate system limitations
Comparison to Related Work
Fact-Checking Systems
Unlike ClaimBuster, FactMata, or Full Fact, we don't verify factual accuracy. Instead, we assess writing style and presentation. These are complementary approaches:
- Fact-checkers verify claims against evidence
- We identify rhetorical techniques that may signal low credibility
Bias Detection Systems
Media bias detectors like AllSides or Ad Fontes Media provide manual ratings. Our political affiliation model automates this but:
- Lacks the editorial judgment of human analysts
- Can process at scale (thousands of articles per day)
- Focuses on individual articles rather than outlet-level ratings
Content Moderation Tools
Perspective API and similar tools focus on single dimensions (toxicity, profanity). Our multi-factor approach is broader but:
- May be less accurate on any single factor
- Provides holistic assessment valuable for news context
Impact & Applications
Educational Use Cases
- Media Literacy Courses: Students can analyze articles and compare their assessments to the system's scores
- Journalism Programs: Budding journalists can test their writing for unintended bias or sensationalism
- Critical Thinking Training: Teaches users to ask multidimensional questions about sources
Platform Integration
- Social Media: Could provide context labels on shared articles
- News Aggregators: Could highlight quality signals to readers
- Browser Extensions: Could provide real-time credibility assessments
Research Applications
- Large-Scale Media Studies: Analyze thousands of articles to identify trends
- Polarization Research: Study how partisan framing correlates with engagement
- Information Operations: Detect coordinated inauthentic behavior patterns
Future Work
Architectural Priorities
1. ReAct and Structured Rationalization
We will implement ReAct-style (Reasoning + Acting) structured rationalization across all agents to combat hallucinated alignment and improve instruction fidelity.
Implementation: Each sub-agent will return:
- Rationale Field: Explicit reasoning trace showing how the prediction was derived
- Confidence Score: Calibrated uncertainty estimate (0.0–1.0)
- "Insufficient Information" Option: Allows agents to abstain when evidence is ambiguous
Expected Benefit: Reduces overconfident predictions and provides interpretable decision trails for debugging and auditing.
2. Agent-as-a-Proxy Refactoring
All statistical models will be wrapped as AgentTools with explicit pre-execution and post-execution validation.
Architecture:
- Pre-Execution Reasoning: Agent validates input format and plans tool invocation strategy
- Tool Invocation: Deterministic call to predictive model (XGBoost, logistic regression, etc.)
- Post-Execution Validation: Agent checks output plausibility and confidence bounds
- Modular Debugging: Each tool call is logged with input/output pairs for reproducibility
Expected Benefit: Separates orchestration logic from predictive inference, enabling engineering-grade system design with debuggable components.
3. Dynamic Weighted Factuality Scoring
Our prompt experiments revealed uneven factor reliability. Rather than uniform aggregation, we will implement dynamic weighting based on empirical reliability.
Strategy:
- Empirical Reliability Measurement: Track per-factor accuracy on validation set
- Adaptive Weighting: Weight final scores proportionally to factor reliability
- Factor Exclusion Threshold: Remove factors with accuracy below 50% from final aggregation
- User-Configurable Weights: Allow users to prioritize factors most relevant to their use case
Expected Benefit: Prioritizes system stability and accuracy over superficial factor inclusion. Prevents low-performing factors from contaminating overall credibility assessment.
Model Improvements
Expanded Training Data
- Scale Ground Truth: Expand from 35 to 500+ articles with diverse sources and topics
- Temporal Coverage: Include articles from multiple years to capture language evolution
- Domain Diversity: Add sports, entertainment, science, local news beyond political coverage
- Multi-Annotator Consensus: Use multiple annotators per article to measure agreement and surface ambiguity
Additional Factuality Factors
- Verifiable Claims: Identify factual claims that can be checked against databases
- Source Transparency: Assess whether articles attribute claims to named sources
- Evidence Quality: Evaluate whether claims are supported by data, studies, or expert quotes
- Logical Coherence: Detect logical fallacies or inconsistent reasoning
- Satire Detection: Explicitly classify satirical content to avoid false positives
- Engagement Bait: Identify content designed to maximize shares/comments rather than inform
Model Architecture Enhancements
- Fine-Tuned LLMs: Fine-tune smaller open-source models on our ground truth for faster, cheaper inference
- Ensemble Methods: Combine multiple LLMs (GPT, Claude, Gemini) and aggregate predictions
- Confidence Calibration: Train models to output well-calibrated uncertainty estimates
- Transfer Learning: Test whether models trained on US news generalize to other countries/languages
System Features
Real-Time Analysis
- Browser Extension: Analyze articles on any website as users browse
- API Endpoint: RESTful API for third-party integrations
- Batch Processing: Efficiently process thousands of articles for research studies
Historical Tracking
- Outlet Profiles: Build aggregate credibility profiles per news source over time
- Trend Analysis: Track how sensationalism or bias changes during election cycles
- A/B Testing: Compare article versions (e.g., print vs. web headlines)
User Customization
- Adjustable Thresholds: Let users set their own sensitivity for each factor
- Factor Weighting: Users prioritize which factors matter most to them
- Personalized Feeds: Recommend articles matching user-defined quality criteria
Evaluation & Validation
Inter-Rater Reliability Studies
- Measure agreement between multiple human annotators on same articles
- Compare human-AI agreement to human-human agreement
- Identify factors with highest/lowest annotation consistency
User Studies
- Utility Assessment: Do users find the six-factor breakdown helpful?
- Decision Impact: Does the tool change how users evaluate articles?
- Trust Calibration: Does the tool improve or harm users' trust calibration?
Adversarial Testing
- Test against adversarially crafted articles designed to fool the system
- Red team exercise: Can human writers evade detection?
- Robustness to paraphrasing, synonym substitution, etc.
Multilingual Expansion
- Spanish: High priority given US demographics and Latin American news
- Mandarin: Important for analyzing Chinese state media and diaspora news
- Arabic: Critical for Middle East coverage and misinformation tracking
- Translation Robustness: Test whether models work on translated articles
Integration with Fact-Checking
- Claim Extraction: Identify specific factual claims within articles
- ClaimBuster Integration: Route extracted claims to fact-checking APIs
- Evidence Retrieval: Automatically search for supporting/refuting evidence
- Holistic Scoring: Combine factuality factors with claim verification results
Explainability Improvements
- Highlighted Text: Show which sentences/phrases influenced each factor score
- Contrastive Explanations: "This would score lower if the headline said X instead"
- Factor Interdependencies: Visualize how factors correlate (e.g., high clickbait often comes with high sensationalism)
Deployment & Sustainability
Cost Optimization
- Model Distillation: Train smaller student models from LLM teacher outputs
- Caching Strategy: Store predictions for frequently analyzed articles
- Selective LLM Usage: Only invoke LLM when model is uncertain
Privacy & Security
- Local Processing: Option to run models entirely on-device without API calls
- Data Anonymization: Strip personal information before analysis
- Audit Logging: Track who analyzes what for accountability
Open Source Contributions
- Release trained models under permissive license
- Open-source the evaluation harness for community benchmarking
- Contribute FCoT prompting methodology to research community
- Establish benchmark dataset for factuality detection research
References
Datasets
- Clickbait Dataset: Aman Anand Rai. (2023). Clickbait Dataset. Kaggle. https://www.kaggle.com/datasets/amananandrai/clickbait-dataset
- GoEmotions: Demszky, D., Movshovitz-Attias, D., Ko, J., Cowen, A., Nemade, G., & Ravi, S. (2020). GoEmotions: A Dataset of Fine-Grained Emotions. ACL 2020.
Tools & APIs
- Google ADK: Google Agent Development Kit. Documentation
- OpenRouter: Unified API for multiple LLM providers. https://openrouter.ai/
- Gemini API: Google AI Studio. https://ai.google.dev/
- Streamlit: Web framework for ML demos. https://streamlit.io/
- VADER: Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. ICWSM.
Related Work
- Fake News Detection: Zhou, X., & Zafarani, R. (2020). A Survey of Fake News: Fundamental Theories, Detection Methods, and Opportunities. ACM Computing Surveys.
- Clickbait Detection: Potthast, M., Köpsel, S., Stein, B., & Hagen, M. (2016). Clickbait Detection. ECIR 2016.
- Political Bias Detection: Baly, R., Karadzhov, G., Alexandrov, D., Glass, J., & Nakov, P. (2018). Predicting Factuality of Reporting and Bias of News Media Sources. EMNLP 2018.
- Toxicity Detection: Perspective API by Jigsaw/Google. https://perspectiveapi.com/
- Chain-of-Thought Prompting: Wei, J., Wang, X., Schuurmans, D., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022.
- Multi-Agent Systems: Park, J.S., O'Brien, J.C., Cai, C.J., et al. (2023). Generative Agents: Interactive Simulacra of Human Behavior. UIST 2023.
Media Literacy Resources
- AllSides: Media bias ratings. https://www.allsides.com/
- Ad Fontes Media: Media Bias Chart. https://adfontesmedia.com/
- News Literacy Project: Educational resources. https://newslit.org/
Repository & Demo
- GitHub Repository: https://github.com/gavmere/capstone_factuality_factors
- Demo Application: Run
streamlit run demo.pyto test the system interactively - Evaluation Harness:
streamlit run evals/app.pyfor benchmarking
Team Contributions
- Daniel Birman - Integrated agents, implemented tool calling, handled API optimization
- Gavin Meregillano - Built evaluation harness, parallelization framework, metrics logging
- Selina Wu - Constructed and labeled dataset, validated outputs, prompting method evaluation
Acknowledgments
This project was completed as part of DSC 180B (Data Science Capstone) at UC San Diego. We thank our mentors and peers for feedback throughout development.