Generative AI for Good: Detecting Misinformation Online

Gavin Meregillano, Daniel Birman, Selina Wu

Combining Generative AI, Traditional Machine Learning, and Human-in-the-Loop to detect misinformation online.

Project Code Repository Report

Abstract

This project presents a hybrid approach to detecting misinformation in online news articles by analyzing six key factuality factors: Clickbait, Headline-Body Relation, Political Affiliation, Sensationalism, Sentiment Analysis, and Toxicity. We combine traditional machine learning models, generative AI (LLMs), and multi-agent architectures to create a comprehensive news article analysis system. Our approach demonstrates that combining specialized predictive models with LLM-based reasoning can provide nuanced, interpretable assessments of article quality and potential misinformation indicators. The system includes a Streamlit demo application, a Google ADK multi-agent pipeline, and a comprehensive evaluation harness for benchmarking against ground truth data.

Introduction

In an era of information overload, distinguishing credible journalism from misinformation, sensationalism, and clickbait has become increasingly challenging for everyday readers. News consumers face articles that may use emotionally charged language, misleading headlines, or biased framing—all of which can distort public understanding of important issues.

Our project addresses this challenge by building an automated system that evaluates news articles across multiple dimensions of credibility and quality. Rather than making a binary "fake news" determination, we assess articles on six distinct factuality factors:

Clickbait: Does the headline use sensational or deceptive language to bait clicks?
Headline-Body Relation: How well does the headline represent the actual content?
Political Affiliation: Does the article show partisan lean?
Sensationalism: Does the article use emotional language to evoke strong reactions?
Sentiment Analysis: What is the overall emotional tone (positive, negative, neutral)?
Toxicity: Does the language contain hostile, offensive, or dehumanizing content?

This multi-dimensional approach provides readers with actionable insights about article quality without oversimplifying the complex nature of news credibility.

Motivation

The Problem: Information Ecosystem Pollution

Misinformation and low-quality journalism have real-world consequences. They erode trust in institutions, polarize communities, and can influence critical decisions from voting to public health. However, existing fact-checking approaches face significant limitations:

Binary classifications (true/false) fail to capture the nuanced ways articles can mislead
Manual fact-checking doesn't scale to the volume of content published daily
Pure AI approaches often lack transparency and explainability
Single-factor analysis misses the multifaceted nature of article quality

Our Solution: Multi-Factor Analysis

We hypothesize that analyzing multiple factuality factors provides a richer, more actionable assessment than any single metric. For example, an article might be low on clickbait but high on sensationalism and partisan bias. This combination suggests a different credibility profile than an article that's clickbait-heavy but politically neutral.

Target Users & Impact

Our system serves multiple stakeholders:

News Consumers: Empowered to make informed decisions about article credibility
Educators: Can use the tool to teach media literacy
Content Moderators: Can prioritize manual review of potentially problematic content
Researchers: Can study patterns in news coverage at scale
Journalists: Can use self-assessment tools to improve their writing

Why Hybrid AI?

We combine traditional ML models with generative AI because each has complementary strengths:

Traditional ML: Fast, consistent, trained on domain-specific patterns
LLMs: Contextually aware, can handle nuance, provide explanations
Multi-Agent Systems: Enable specialized reasoning for each factor with coordinated output

Our Approach

System Architecture

Our solution consists of three integrated components:

1. Specialized Predictive Models

Each factuality factor is implemented as a standalone model class inheriting from FactualityFactor:

Clickbait Model: XGBoost classifier trained on headline embeddings from the Kaggle Clickbait Dataset
Headline-Body Relation Model: Cosine similarity between embedded headline and article body text
Political Affiliation Model: Logistic regression classifier trained on Gemini embeddings to detect Democratic, Republican, Neutral, or Other stance
Sensationalism Model: Multinomial classifier over emotional feature embeddings using GoEmotions taxonomy
Sentiment Analysis: VADER-based sentiment model optimized for news text
Toxicity Model: RoBERTa-based multi-class classifier (Friendly → Neutral → Rude → Toxic → Super Toxic)

2. LLM-Based Evaluation Pipeline

We leverage Large Language Models (Gemini via AI Studio and OpenRouter) to provide:

Holistic article analysis considering all six factors simultaneously
Natural language explanations for scores
Contextual understanding that traditional models may miss

Our prompts implement a Fractal Chain-of-Thought (FCoT) reasoning protocol with Distortion-to-Information Ratio (DIR) calculation and Conservative Inference Minimization (CIM) to reduce overconfident predictions.

3. Multi-Agent System (Google ADK)

Built using Google Agent Development Kit (ADK), our agent system features:

User Input
(Headline / Body / Text)

↓

Root Orchestrator Agent

↓

Clickbait Agent

Headline–Body Agent

Political Agent

Sensationalism Agent

Sentiment Agent

Toxicity Agent

↓

Unified JSON Output

Figure: Agentic architecture for multi-factor factuality analysis. The root agent routes requests to factor-specific sub-agents, which call predictive tools and return structured outputs.

Agentic Workflow Design

Design Motivation: Pure LLM prompting suffers from inconsistency, hallucination, and opacity. Our agentic design grounds factor-level reasoning in predictive models, with the LLM acting as a controller and coordinator rather than the sole reasoning engine.

Root Orchestrator Agent serves as the high-level controller:

Parses user requests and determines required factuality factors
Invokes appropriate sub-agents sequentially or in parallel
Aggregates outputs into a single structured JSON response
Separates decision logic (LLM) from predictive inference (tools)

Factor-Specific Sub-Agents are independent components that:

Validate structured input via Pydantic schemas
Call their associated predictive model as a tool
Return JSON containing scores, labels, and confidence values

This ensures predictions are deterministic and grounded in trained statistical models rather than generated heuristically.

Advantages of Agentic Design:

Modularity: Each factor can be independently developed and improved
Interpretability: Outputs are structured and model-backed
Scalability: New factors can be added as new sub-agents
Reduced Hallucination: LLM is constrained to tool-calling rather than speculative reasoning
Experimental Flexibility: Can compare LLM-only, predictive-only, and hybrid configurations

Inference Flow

Input: Article headline, body text, and optional URL
Parallel Processing: Each factor's model generates initial predictions
LLM Enhancement: Gemini provides contextual analysis and scoring
Score Fusion: Combine model and LLM predictions weighted by confidence
Output: Factor scores with explanations and overall credibility profile

Key Technical Decisions

Hybrid Architecture: Balances speed (models) with understanding (LLMs)
Factor Independence: Each factor is evaluated separately to avoid cascade effects
Explainability: All scores include reasoning to build user trust
Modular Design: Easy to update individual models or swap LLM providers

Data Collection

Ground Truth Dataset

We curated a diverse ground truth dataset containing 35 news articles from various sources spanning the political spectrum and topic areas. Articles were manually annotated across all six factuality factors by our team through careful deliberation and consensus.

Dataset Characteristics

Source Diversity

Articles were collected from:

Mainstream outlets: CNN, Fox News, The Washington Post, AP News, BBC
Fact-checking sites: PolitiFact, FactCheck.org
Partisan media: Newsmax, The Daily Beast, Mother Jones
Satire: The Onion (to test edge cases)
Alternative media: Infowars, New York Post opinion pieces

Topic Coverage

Immigration policy and ICE enforcement
Political figures and governance
International relations
Local news (seal pups, theater reviews)
Sports and entertainment

Annotation Schema

Each article was scored on six dimensions:

Clickbait: 0-100 scale (0 = factual, 100 = pure clickbait)
Political Affiliation: Democratic, Republican, Neutral, Other
Sensationalism: 0-100 scale (0 = objective, 100 = extremely sensational)
Sentiment Analysis: Positive, Negative, Neutral
Headline-Body Relation: 0-100 scale (0 = unrelated, 100 = perfect match)
Toxicity: Friendly, Neutral, Rude, Toxic, Super_Toxic

Data Quality Measures

Multiple annotators: Team consensus on ambiguous cases
Clear rubrics: Detailed scoring guidelines for consistency
Edge case inclusion: Satire and highly biased sources to test boundaries
Balanced representation: Mix of high/low scores across all factors

Training Data for Models

Clickbait Model

Trained on the Kaggle Clickbait Dataset containing 32,000+ labeled headlines from various news sources and social media.

Political Affiliation Model

Trained using Gemini embeddings on a curated dataset of articles from known partisan sources, labeled by publication bias.

Sensationalism Model

Trained using GoEmotions emotion taxonomy with Gemini embeddings to detect emotionally charged language patterns indicative of sensationalism.

Other Models

Sentiment: Leverages pre-trained VADER and TextBlob models
Toxicity: Uses patterns from Perspective API training
Headline-Body Relation: Purely LLM-based semantic similarity

Model Training

Training Infrastructure

Models were trained using a combination of classical ML frameworks (scikit-learn) and modern LLM APIs (Gemini, OpenRouter). Each factor's model was developed independently as a modular component inheriting from the FactualityFactor base class.

Factor-Specific Training Details

Clickbait Model

Architecture: Binary classifier (clickbait vs. non-clickbait)
Features: TF-IDF vectors of headline text
Model: Logistic Regression / Random Forest
Training Data: 32K+ labeled headlines from Kaggle
Performance Focus: High precision to avoid false positives

Political Affiliation Model

Architecture: Multi-class classifier (Democratic/Republican/Neutral/Other)
Features: Gemini API embeddings (768-dimensional)
Model: Gradient Boosting (saved as political_affiliation_gemini.joblib)
Training: Articles from sources with known editorial slants
Challenge: Distinguishing neutral reporting from centrist perspective

Sensationalism Model

Architecture: Binary classifier (sensational vs. non-sensational)
Features: Emotion distributions from GoEmotions + Gemini embeddings
Model: Ensemble saved as sensationalism_gemini_goemotions.joblib
Emotion Focus: Anger, fear, surprise, disgust as sensationalism indicators
Threshold Tuning: 60+ sensationalism score triggers "Sensational" label

Sentiment Analysis Model

Architecture: Composite of multiple pre-trained models
Components: VADER (social media text), TextBlob (general sentiment)
Output: Averaged polarity scores mapped to Positive/Negative/Neutral
No Training Required: Uses established lexicons

Toxicity Model

Architecture: 5-level classifier (Friendly → Super_Toxic)
Approach: Pattern matching + API-based scoring
Levels: Gradual escalation from neutral to explicitly harmful content
Training: Based on Perspective API's toxicity taxonomy

Headline-Body Relation Model

Architecture: LLM-based semantic similarity
Method: Cosine similarity of embeddings + LLM contextual scoring
API: Gemini/OpenRouter for embedding generation
Output: 0-100 similarity score

Training Methodology

Data Preparation: Text cleaning, tokenization, feature extraction
Model Selection: Experimented with multiple architectures per factor
Hyperparameter Tuning: Grid search for optimal configurations
Cross-Validation: 5-fold CV to prevent overfitting
Model Serialization: Saved as .joblib or .json files for deployment

Training Artifacts

All trained models are stored in the models/ directory:

clickbait/clickbait_model.json
political_affiliation/political_affiliation_gemini.joblib
sensationalism/sensationalism_gemini_goemotions.joblib
Party labels: political_affiliation/party_labels.json

Model Versioning & Updates

Models are versioned and can be retrained as new data becomes available. The modular architecture allows updating individual factors without affecting the entire system.

Prompt Engineering

LLM Prompt Design Philosophy

Our prompt engineering strategy focuses on structured reasoning to produce consistent, explainable outputs. We implemented a custom Fractal Chain-of-Thought (FCoT) protocol with dual optimization objectives.

Fractal Chain-of-Thought (FCoT) Protocol

The FCoT protocol guides LLMs through recursive self-correction across six reasoning layers. We developed two versions: FCoT v1 (basic fractal reasoning) and FCoT v2 (dual-objective optimization).

FCoT v2: Dual-Objective Fractal Optimization

FCoT v2 implements two explicit objective functions to improve calibration and reduce overconfident predictions:

Objective 1: Distortion-to-Information Ratio (DIR)

We defined:

DIR = ^DU/_{IU + 1}

Where:

Informational Units (IU): Factual, attributed, verifiable statements containing statistics, named entities, or direct attribution
Distortion Units (DU): Hyperbole, urgency framing, emotional intensifiers, extreme certainty markers

Interpretation: Higher DIR indicates stronger likelihood of clickbait or sensationalism. DIR > 0.75 = high distortion, DIR 0.3–0.75 = moderate, DIR < 0.3 = low distortion.

Objective 2: Conservative Inference Minimization (CIM)

CIM explicitly constrains the model to:

Require consistent partisan framing for political affiliation classification (not just mention of political figures)
Base toxicity assessments on explicit wording only
Use ONLY information present in the article text—no external knowledge assumptions
Minimize overconfident labeling without textual justification

Recursive Reasoning Layers

FCoT v2 requires six internal processing layers:

Layer 1: Local Signal Tagging (Micro-Level)

Scan each sentence and label as IU / DU / Neutral
Compute preliminary DIR score
Only count explicit linguistic features, not subjective interpretation

Layer 2: Local Error Minimization

Correct for quoted speech being misclassified as author stance
Distinguish emotional reporting of tragic events from sensationalism
Reduce false positive distortion tagging

Layer 3: Aperture Expansion (Document-Level)

Compare headline vs. body distortion density
Check if emotional intensity is proportionate to content
Identify balancing facts introduced later in the article

Layer 4: Fractal Consistency Check

Enforce symmetry: High DIR → higher Clickbait and Sensationalism
Low DIR + high IU density → higher Headline-Body Relation score
DIR should NOT automatically determine Political Affiliation or Toxicity

Layer 5: Inter-Agent Reflective Check

Simulate two internal evaluators: Skeptical Auditor vs. Conservative Baseline Analyst
Resolve disagreements by choosing the more conservative classification
Minimize overconfident labeling and ambiguity inflation

Layer 6: Temporal Re-Grounding

Re-scan the full article before finalizing
Check for late-stage clarifications or corrections
Ensure output reflects full-text evaluation, not first-impression anchoring

Conservative Inference Minimization (CIM)

CIM is our second optimization objective, designed to minimize overconfident labeling:

Classifications must be justified by explicit textual evidence
No assumptions from external context or world knowledge
Political affiliation requires consistent partisan framing, not isolated statements
When uncertain, select the more neutral classification

Factor-Specific Prompts

Clickbait Scoring Rubric

0-20: Factual, straightforward, informative
21-50: Slightly curiosity-driven but mostly accurate
51-80: Strong clickbait (all-caps, "you won't believe")
81-100: Pure clickbait, deceptive or extremely sensationalized

Political Affiliation Guidelines

Must show consistent partisan framing across multiple passages
Neutral = objective reporting or balanced coverage
Other = framing that doesn't align with US two-party system

Sensationalism Detection

Focus on emotional language intended to evoke strong reactions
Distinguish between reporting on emotional topics vs. using emotional language
High sensationalism = disproportionate emotion relative to factual content

Prompt Templates

All prompts follow a consistent structure:

Role Definition: "You are a Fractal Reasoning Agent..."
Objective Statement: Dual objectives (DIR + CIM)
Reasoning Protocol: Six-layer FCoT process
Output Constraints: JSON format, no intermediate reasoning
Factor-Specific Rubric: Scoring guidelines with examples

Prompt Iteration & Testing

Prompts were refined through iterative testing:

Version 1: Simple rubric-based scoring (inconsistent results)
Version 2: Added few-shot examples (improved but still variable)
Version 3 (Current): FCoT + CIM protocol (significantly more consistent)

Multi-Agent Orchestration Prompt

The orchestrator agent coordinates sub-agents and synthesizes results:

Receives article input (headline + body)
Delegates to six specialized agents in parallel
Combines scores using the combine_scores tool
Generates natural language summary of findings
Highlights potential credibility concerns

Metrics

Evaluation Framework

We evaluate our system using a comprehensive testing harness that compares LLM predictions against ground truth annotations across all six factuality factors. The evaluation framework is implemented in evals/ and supports parallel execution for faster benchmarking.

Metrics by Factor Type

Numeric Factors (Clickbait, Sensationalism, Headline-Body Relation)

For factors scored on a 0-100 scale, we compute:

Mean Absolute Error (MAE): Average absolute difference between prediction and ground truth
Root Mean Squared Error (RMSE): Square root of mean squared differences (penalizes larger errors)
Tolerance-based Accuracy: Percentage of predictions within acceptable threshold (default tolerance = 0.1 or 10 points)

Formula: MAE = (1/n) Σ |predicted - actual|

Interpretation: Lower MAE/RMSE = better performance. Tolerance-based accuracy measures practical agreement within realistic margins.

Categorical Factors (Political Affiliation, Sentiment, Toxicity)

For categorical predictions, we compute:

Accuracy: Percentage of exact matches with ground truth
Weighted F1 Score: Accounts for class imbalance in the ground truth dataset
Confusion Matrix: Visualizes which categories are confused with each other

Formula: F1 = 2 × (Precision × Recall) / (Precision + Recall)

Interpretation: Weighted F1 handles class imbalance better than macro-average. Confusion matrices reveal systematic biases and misclassification patterns.

Overall System Metrics

Cross-Factor Accuracy: Average accuracy across all six factors
Execution Time: Average time per article (important for scalability)
Error Rate: Percentage of articles that fail to process
Consistency Score: Agreement between model and LLM predictions

Evaluation Methodology

Ground Truth Loading: Read annotated CSV from data/ground_truth.csv
Parallel Processing: Configure number of worker threads (default: 5)
Article Processing: For each article, generate predictions for all six factors
Comparison: Match LLM outputs to ground truth using factor-appropriate metrics
Logging: Save individual results to timestamped CSV in evals/logs/
Aggregation: Compute summary statistics and append to master log

Evaluation Logging

Every evaluation run produces three artifacts:

Detailed Log: evaluation_logs_YYYYMMDD_HHMMSS.csv with per-article results
Metadata: metadata/metadata_YYYYMMDD_HHMMSS.json with prompts and parameters
Summary: metadata/summary_YYYYMMDD_HHMMSS.json with aggregated metrics

Validation & Quality Checks

CSV Validation: evals/validate_csv.py ensures ground truth format compliance
Normalization: Handles both 0-1 and 0-100 scales automatically
Missing Value Handling: Graceful handling of incomplete annotations
Category Mapping: Normalizes label variations (e.g., "Democrat" → "Democratic")

Baseline Comparisons

We compare our hybrid approach against three baselines:

Model-Only: Predictions from trained models without LLM enhancement
LLM-Only: Pure LLM predictions without specialized models
Random Baseline: Random predictions within valid ranges

Hypothesis: The hybrid approach outperforms either component alone by combining speed (models) with contextual understanding (LLMs).

Limitations & Considerations

Small Ground Truth: 35 articles limits statistical confidence
Annotation Subjectivity: Some factors (e.g., sensationalism) have inherent interpretation variance
API Variability: LLM outputs may vary slightly between runs (temperature > 0)
Cost-Accuracy Tradeoff: More sophisticated LLMs cost more but may not always improve accuracy

Results & Analysis

Prompting Techniques Evaluation

We systematically compared six prompting methods to measure the impact of prompt architecture on factuality assessment. Rather than treating prompting as a stylistic choice, we treated it as a controlled experimental variable.

Prompting Methods Compared

Base Prompt: Minimal instructions with factor definitions only
In-Context Learning (ICL): Two fully structured examples (low-clickbait neutral + high-clickbait sensational)
Chain-of-Thought + ICL: Structured reasoning steps with examples
Chain-of-Thought (CoT) only: Structured reasoning without examples
Fractal Chain-of-Thought v1 (FCoT1): Recursive reasoning with basic constraints
Fractal Chain-of-Thought v2 (FCoT2): Dual-objective optimization (DIR + CIM)

Overall Accuracy Results

Prompting Method	Overall Accuracy
Base Prompt	38.57%
In-Context Learning (ICL)	55.21%
CoT + ICL	49.43%
CoT (no ICL)	54.44%
Fractal CoT v1 (FCoT1)	64.17%
Fractal CoT v2 (FCoT2)	60.00%

Overall accuracy reflects percentage agreement across all six factuality factors.

Key Findings

FCoT1 achieved the highest overall performance at 64.17%—a 25.6 percentage point improvement over the base prompt
ICL substantially improved performance from 38.57% to 55.21%, confirming that calibration examples stabilize model outputs
CoT without examples outperformed CoT+ICL (54.44% vs 49.43%), suggesting reasoning steps can sometimes conflict with example patterns
FCoT2 showed strong factor-specific performance: 70% clickbait, 70% sensationalism, 80% political affiliation
Base prompt catastrophically failed on Headline–Body Relation (0%) and Sensationalism (0%)

Factor-Level Performance Observations

⚠️ Toxicity Instability in FCoT2: While FCoT2 improved distortion detection, toxicity accuracy dropped to only 20%. The recursive reasoning layers appear to have introduced an unintended conservatism bias that suppresses explicit insult detection when distortion analysis dominates attention.

⚠️ Sensationalism Collapse Under CoT+ICL: CoT+ICL resulted in only 6.90% accuracy on sensationalism. The model over-weighted example patterns and failed to generalize distortion detection to new linguistic forms.

Confirmation: These findings confirm that prompt structure is architectural, not stylistic. Different prompting methods induce fundamentally different reasoning behaviors and failure patterns.

System Capabilities Demonstrated

Multi-Factor Analysis

Our system successfully analyzes news articles across six distinct dimensions simultaneously. The demo application (demo.py) provides an interactive interface where users can input any article and receive comprehensive factuality assessments.

Model Performance Characteristics

Clickbait Detection: Model identifies clear patterns like "You won't believe...", question-based hooks, and sensational adjectives
Headline-Body Relation: LLM-based semantic similarity effectively measures alignment between headline promises and actual content
Political Affiliation: Classifier detects partisan framing from sources like Fox News (Republican) vs. MSNBC (Democratic)
Sensationalism: Emotion-based approach identifies emotionally charged language patterns
Sentiment: Composite model handles both neutral news and opinion pieces
Toxicity: Five-level classification captures gradations from friendly to extremely toxic

Qualitative Observations

Hybrid Architecture Benefits

Speed: Traditional models provide instant predictions (< 100ms per factor)
Context: LLMs add nuanced understanding that models might miss
Explainability: LLM outputs include reasoning, building user trust
Robustness: When model fails or is uncertain, LLM provides fallback

Edge Cases & Challenges

Satire Detection: The Onion articles score high on clickbait and sensationalism (as expected), but the system doesn't explicitly label them as satire—this is a known limitation
Fact-Checking vs. Factuality: Our system assesses writing style and presentation, not factual accuracy of claims. An article can be low-sensationalism but contain false information
Context-Dependent Toxicity: Political news may contain quoted toxic language without the article itself being toxic. The FCoT protocol helps distinguish these cases
Neutral vs. Balanced: Distinguishing truly neutral reporting from articles that present both sides can be challenging for political affiliation scoring

Example Analysis

Article: "Video: 'Gay Batman' Has Meltdown At City Council Meeting Over ICE"

Source: Infowars

Clickbait: 95/100 (highly sensational keyword-stuffed headline)
Political Affiliation: Republican (critical of immigration protesters)
Sensationalism: 95/100 (emotionally charged language: "lunatic leftist", "meltdown")
Sentiment: Negative (mocking tone toward subject)
Headline-Body Relation: 75/100 (headline accurately describes event but uses inflammatory framing)
Toxicity: Super_Toxic (dehumanizing language, inflammatory labels)

Analysis: This example demonstrates how articles can score high across multiple factors simultaneously. The system correctly identifies extreme cases of bias and inflammatory language.

Article: "First La Jolla seal pup of 2026 born"

Source: Fox 5 San Diego (Local News)

Clickbait: 25/100 (straightforward, factual headline)
Political Affiliation: Neutral (non-political local interest story)
Sensationalism: 35/100 (celebratory but not sensational)
Sentiment: Positive (uplifting news)
Headline-Body Relation: 100/100 (perfect match)
Toxicity: Friendly (wholesome content)

Analysis: Demonstrates the system can correctly identify high-quality, objective reporting and differentiate it from problematic content.

Model vs. LLM Comparison

In our hybrid architecture, we observed:

Strong Agreement: For clear-cut cases (e.g., obvious clickbait), models and LLMs align closely
LLM Advantages: Better at handling sarcasm, context-dependent language, and complex multi-paragraph structure
Model Advantages: More consistent across similar inputs, less susceptible to prompt variations
Disagreements: Usually occur in borderline cases where human annotators also show lower agreement

User Interface & Accessibility

The Streamlit demo application provides:

Real-time article analysis with progress indicators
Side-by-side comparison of Model vs. LLM predictions
Visual presentation of scores with color-coded indicators
Model status dashboard showing which components are loaded
Error handling for API failures and timeouts

Scalability & Performance

Throughput: System can process ~5-10 articles per minute with parallel workers
Bottleneck: LLM API calls are the limiting factor (rate limits, latency)
Optimization: Caching model predictions for repeated articles reduces redundant computation
Cost: Approximately $0.01-0.05 per article for LLM inference depending on article length

Limitations Acknowledged

Ground Truth Size: 35-article test set limits confidence in quantitative metrics
Source Diversity: Dataset is heavily weighted toward US political news; may not generalize to other domains
Temporal Coverage: Articles are from 2026; models may not capture emerging language patterns
Language: System only supports English-language articles
Satire Detection: No explicit satire classifier; satirical articles may score as highly problematic
API Dependency: Requires external API keys (OpenRouter, Gemini) for full functionality

Discussion

Key Contributions

1. Multi-Dimensional Credibility Framework

Unlike binary "fake news" classifiers, our six-factor approach provides a nuanced credibility profile. This aligns with how humans actually assess article quality—we don't just ask "is this fake?" but rather "is it clickbait?", "is it biased?", "is it toxic?"

2. Hybrid Architecture Design

Combining traditional ML models with LLMs demonstrates that specialized models and general-purpose AI can complement each other. This hybrid approach:

Provides fast, consistent baseline predictions (models)
Adds contextual understanding and flexibility (LLMs)
Enables graceful degradation (if one component fails, the other provides coverage)

3. Fractal Chain-of-Thought Prompting

Our FCoT protocol with DIR and CIM optimization demonstrates a systematic approach to prompt engineering that:

Reduces LLM overconfidence through recursive self-correction
Provides a principled framework for assessing information density vs. distortion
Can be adapted to other NLP tasks requiring careful reasoning

4. Multi-Agent Orchestration

The Google ADK implementation shows how specialized agents can collaborate on complex analysis tasks. Each agent focuses on one factuality factor, then a coordinator synthesizes results into a coherent report.

Implications for Misinformation Detection

Beyond Binary Classification

Traditional misinformation detection often frames the problem as binary classification (real/fake). Our work suggests value in dimensional assessment—measuring articles along multiple axes rather than forcing them into categories. This provides actionable insights:

A highly clickbait headline with objective body content suggests headline issues, not article content problems
High sensationalism + high toxicity + partisan bias is a strong warning sign
Low scores across all factors indicate high-quality journalism

Human-in-the-Loop Augmentation

Our system is designed to augment human judgment, not replace it. By providing factor scores with explanations, we empower users to:

Understand why an article might be problematic
Make informed decisions about whether to trust the content
Develop better media literacy skills over time

Documented Failure Modes

Following best practices in ML research, we prioritize documenting failure modes over cherry-picking successes. Understanding when and why the system fails is critical for future improvements.

1. Sensationalism Collapse Under CoT+ICL

Observation: CoT+ICL achieved only 6.90% accuracy on sensationalism—worse than random guessing.

Diagnosis: The model over-weighted example patterns and failed to generalize distortion detection to new linguistic forms. Adding reasoning steps did not guarantee improved classification.

Implication: Structured reasoning must be combined with robust generalization mechanisms, not just imitation of examples.

2. Political Affiliation Overgeneralization

Observation: Multiple prompting methods incorrectly classified articles mentioning Republican figures as having Republican affiliation.

Diagnosis: The model conflated topic presence (named entities) with ideological stance (framing consistency), violating our Conservative Inference Minimization constraint.

Implication: Affiliation detection requires analyzing framing and rhetoric, not just counting entity mentions.

3. Toxicity Suppression in FCoT2

Observation: FCoT2 improved distortion detection (70% clickbait, 70% sensationalism) but toxicity accuracy dropped to 20%.

Diagnosis: The recursive reasoning layers prioritized distortion analysis and introduced excessive conservatism through CIM constraints, causing the model to underdetect explicit insults.

Implication: Multi-objective optimization can create unintended factor trade-offs. Future versions need dynamic attention allocation across factors.

4. Instruction Drift Across Layers

Observation: Earlier versions of fractal prompts exhibited subtle drift between factor definitions across reasoning layers.

Diagnosis: Without explicit cross-factor consistency checks, distortion signals sometimes leaked into political labeling.

Implication: Fractal reasoning requires explicit inter-layer validation to maintain definitional consistency.

5. Calibration Collapse with Base Prompt

Observation: Base prompt achieved 0% on Headline–Body Relation and 0% on Sensationalism.

Diagnosis: Without examples or structured reasoning, the model misinterpreted these factors entirely, treating them as binary rather than scalar.

Implication: Minimal prompting is insufficient for complex multi-factor evaluation tasks.

Lessons Learned

Prompting as an Engineering Variable

One of our central goals was to move beyond "LLM as a black box" and treat prompting as a controlled experimental variable. Our systematic comparison of six methods demonstrates that prompt structure significantly and measurably impacts performance.

Key Insight: Prompting is not stylistic—it is architectural. Different structures induce different reasoning behaviors.

From Prompting to Agent Engineering

The results demonstrate that prompting improvements alone are insufficient. While fractal reasoning improved distortion detection, it introduced trade-offs across other factors (especially toxicity).

Conclusion: This reinforces the need for agent-level architecture improvements rather than continued ad hoc prompt tuning. The next phase prioritizes structured tool-calling, rationalization, and dynamic factor weighting.

Ethical Considerations

Avoiding Censorship

Our tool is designed for information, not censorship. We provide assessments, not removal recommendations. Users decide how to act on the information.

Bias in Training Data

Our political affiliation model is trained on articles from outlets with known leans. This means:

The model learns current political alignments, which may shift over time
Emerging political movements may be misclassified as "Other"
We should regularly retrain as the political landscape evolves

Transparency & Explainability

By providing reasoning alongside scores, we aim for transparency. However:

LLM reasoning may be post-hoc rationalization rather than true explanation
Users may over-trust AI assessments if presented authoritatively
We need to clearly communicate system limitations

Comparison to Related Work

Fact-Checking Systems

Unlike ClaimBuster, FactMata, or Full Fact, we don't verify factual accuracy. Instead, we assess writing style and presentation. These are complementary approaches:

Fact-checkers verify claims against evidence
We identify rhetorical techniques that may signal low credibility

Bias Detection Systems

Media bias detectors like AllSides or Ad Fontes Media provide manual ratings. Our political affiliation model automates this but:

Lacks the editorial judgment of human analysts
Can process at scale (thousands of articles per day)
Focuses on individual articles rather than outlet-level ratings

Content Moderation Tools

Perspective API and similar tools focus on single dimensions (toxicity, profanity). Our multi-factor approach is broader but:

May be less accurate on any single factor
Provides holistic assessment valuable for news context

Impact & Applications

Educational Use Cases

Media Literacy Courses: Students can analyze articles and compare their assessments to the system's scores
Journalism Programs: Budding journalists can test their writing for unintended bias or sensationalism
Critical Thinking Training: Teaches users to ask multidimensional questions about sources

Platform Integration

Social Media: Could provide context labels on shared articles
News Aggregators: Could highlight quality signals to readers
Browser Extensions: Could provide real-time credibility assessments

Research Applications

Large-Scale Media Studies: Analyze thousands of articles to identify trends
Polarization Research: Study how partisan framing correlates with engagement
Information Operations: Detect coordinated inauthentic behavior patterns

Future Work

Architectural Priorities

1. ReAct and Structured Rationalization

We will implement ReAct-style (Reasoning + Acting) structured rationalization across all agents to combat hallucinated alignment and improve instruction fidelity.

Implementation: Each sub-agent will return:

Rationale Field: Explicit reasoning trace showing how the prediction was derived
Confidence Score: Calibrated uncertainty estimate (0.0–1.0)
"Insufficient Information" Option: Allows agents to abstain when evidence is ambiguous

Expected Benefit: Reduces overconfident predictions and provides interpretable decision trails for debugging and auditing.

2. Agent-as-a-Proxy Refactoring

All statistical models will be wrapped as AgentTools with explicit pre-execution and post-execution validation.

Architecture:

Pre-Execution Reasoning: Agent validates input format and plans tool invocation strategy
Tool Invocation: Deterministic call to predictive model (XGBoost, logistic regression, etc.)
Post-Execution Validation: Agent checks output plausibility and confidence bounds
Modular Debugging: Each tool call is logged with input/output pairs for reproducibility

Expected Benefit: Separates orchestration logic from predictive inference, enabling engineering-grade system design with debuggable components.

3. Dynamic Weighted Factuality Scoring

Our prompt experiments revealed uneven factor reliability. Rather than uniform aggregation, we will implement dynamic weighting based on empirical reliability.

Strategy:

Empirical Reliability Measurement: Track per-factor accuracy on validation set
Adaptive Weighting: Weight final scores proportionally to factor reliability
Factor Exclusion Threshold: Remove factors with accuracy below 50% from final aggregation
User-Configurable Weights: Allow users to prioritize factors most relevant to their use case

Expected Benefit: Prioritizes system stability and accuracy over superficial factor inclusion. Prevents low-performing factors from contaminating overall credibility assessment.

Model Improvements

Expanded Training Data

Scale Ground Truth: Expand from 35 to 500+ articles with diverse sources and topics
Temporal Coverage: Include articles from multiple years to capture language evolution
Domain Diversity: Add sports, entertainment, science, local news beyond political coverage
Multi-Annotator Consensus: Use multiple annotators per article to measure agreement and surface ambiguity

Additional Factuality Factors

Verifiable Claims: Identify factual claims that can be checked against databases
Source Transparency: Assess whether articles attribute claims to named sources
Evidence Quality: Evaluate whether claims are supported by data, studies, or expert quotes
Logical Coherence: Detect logical fallacies or inconsistent reasoning
Satire Detection: Explicitly classify satirical content to avoid false positives
Engagement Bait: Identify content designed to maximize shares/comments rather than inform

Model Architecture Enhancements

Fine-Tuned LLMs: Fine-tune smaller open-source models on our ground truth for faster, cheaper inference
Ensemble Methods: Combine multiple LLMs (GPT, Claude, Gemini) and aggregate predictions
Confidence Calibration: Train models to output well-calibrated uncertainty estimates
Transfer Learning: Test whether models trained on US news generalize to other countries/languages

System Features

Real-Time Analysis

Browser Extension: Analyze articles on any website as users browse
API Endpoint: RESTful API for third-party integrations
Batch Processing: Efficiently process thousands of articles for research studies

Historical Tracking

Outlet Profiles: Build aggregate credibility profiles per news source over time
Trend Analysis: Track how sensationalism or bias changes during election cycles
A/B Testing: Compare article versions (e.g., print vs. web headlines)

User Customization

Adjustable Thresholds: Let users set their own sensitivity for each factor
Factor Weighting: Users prioritize which factors matter most to them
Personalized Feeds: Recommend articles matching user-defined quality criteria

Evaluation & Validation

Inter-Rater Reliability Studies

Measure agreement between multiple human annotators on same articles
Compare human-AI agreement to human-human agreement
Identify factors with highest/lowest annotation consistency

User Studies

Utility Assessment: Do users find the six-factor breakdown helpful?
Decision Impact: Does the tool change how users evaluate articles?
Trust Calibration: Does the tool improve or harm users' trust calibration?

Adversarial Testing

Test against adversarially crafted articles designed to fool the system
Red team exercise: Can human writers evade detection?
Robustness to paraphrasing, synonym substitution, etc.

Multilingual Expansion

Spanish: High priority given US demographics and Latin American news
Mandarin: Important for analyzing Chinese state media and diaspora news
Arabic: Critical for Middle East coverage and misinformation tracking
Translation Robustness: Test whether models work on translated articles

Integration with Fact-Checking

Claim Extraction: Identify specific factual claims within articles
ClaimBuster Integration: Route extracted claims to fact-checking APIs
Evidence Retrieval: Automatically search for supporting/refuting evidence
Holistic Scoring: Combine factuality factors with claim verification results

Explainability Improvements

Highlighted Text: Show which sentences/phrases influenced each factor score
Contrastive Explanations: "This would score lower if the headline said X instead"
Factor Interdependencies: Visualize how factors correlate (e.g., high clickbait often comes with high sensationalism)

Deployment & Sustainability

Cost Optimization

Model Distillation: Train smaller student models from LLM teacher outputs
Caching Strategy: Store predictions for frequently analyzed articles
Selective LLM Usage: Only invoke LLM when model is uncertain

Privacy & Security

Local Processing: Option to run models entirely on-device without API calls
Data Anonymization: Strip personal information before analysis
Audit Logging: Track who analyzes what for accountability

Open Source Contributions

Release trained models under permissive license
Open-source the evaluation harness for community benchmarking
Contribute FCoT prompting methodology to research community
Establish benchmark dataset for factuality detection research

References

Datasets

Clickbait Dataset: Aman Anand Rai. (2023). Clickbait Dataset. Kaggle. https://www.kaggle.com/datasets/amananandrai/clickbait-dataset
GoEmotions: Demszky, D., Movshovitz-Attias, D., Ko, J., Cowen, A., Nemade, G., & Ravi, S. (2020). GoEmotions: A Dataset of Fine-Grained Emotions. ACL 2020.

Tools & APIs

Google ADK: Google Agent Development Kit. Documentation
OpenRouter: Unified API for multiple LLM providers. https://openrouter.ai/
Gemini API: Google AI Studio. https://ai.google.dev/
Streamlit: Web framework for ML demos. https://streamlit.io/
VADER: Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. ICWSM.

Related Work

Fake News Detection: Zhou, X., & Zafarani, R. (2020). A Survey of Fake News: Fundamental Theories, Detection Methods, and Opportunities. ACM Computing Surveys.
Clickbait Detection: Potthast, M., Köpsel, S., Stein, B., & Hagen, M. (2016). Clickbait Detection. ECIR 2016.
Political Bias Detection: Baly, R., Karadzhov, G., Alexandrov, D., Glass, J., & Nakov, P. (2018). Predicting Factuality of Reporting and Bias of News Media Sources. EMNLP 2018.
Toxicity Detection: Perspective API by Jigsaw/Google. https://perspectiveapi.com/
Chain-of-Thought Prompting: Wei, J., Wang, X., Schuurmans, D., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022.
Multi-Agent Systems: Park, J.S., O'Brien, J.C., Cai, C.J., et al. (2023). Generative Agents: Interactive Simulacra of Human Behavior. UIST 2023.

Media Literacy Resources

AllSides: Media bias ratings. https://www.allsides.com/
Ad Fontes Media: Media Bias Chart. https://adfontesmedia.com/
News Literacy Project: Educational resources. https://newslit.org/

Repository & Demo

GitHub Repository: https://github.com/gavmere/capstone_factuality_factors
Demo Application: Run streamlit run demo.py to test the system interactively
Evaluation Harness: streamlit run evals/app.py for benchmarking

Team Contributions

Daniel Birman - Integrated agents, implemented tool calling, handled API optimization
Gavin Meregillano - Built evaluation harness, parallelization framework, metrics logging
Selina Wu - Constructed and labeled dataset, validated outputs, prompting method evaluation

Acknowledgments

This project was completed as part of DSC 180B (Data Science Capstone) at UC San Diego. We thank our mentors and peers for feedback throughout development.