Appendix C: Web Audit Suite User Guide
MX-Protocols
January 2026
Appendix C: Web Audit Suite User Guide
Complete guide to auditing your website for AI agent compatibility using Web Audit Suite.
Installation
git clone
cd mx-handbook/mx-audit
npm installBasic Usage
Single Page Audit
# Audit your homepage
npm start -- -s https://example.com -c 10Full Site Audit
# Audit from sitemap (unlimited pages)
npm start -- -s https://example.com/sitemap.xml -c -1
# Audit specific number of pages
npm start -- -s https://example.com/sitemap.xml -c 50Complete Audit with All Reports
npm start -- -s https://example.com \
--enable-history \
--generate-dashboard \
--generate-executive-summaryPerformance-Optimized Audits
The Web Audit Suite includes production-tested performance optimizations for large sites:
# Custom browser pool and concurrency for large sites
npm start -- -s https://example.com --browser-pool-size 5 --url-concurrency 5
# Expected performance: 100 URLs in ~10 minutesPerformance Features:
- Browser pooling: 97% reduction in browser launch overhead
- Concurrent processing: Multiple URLs analyzed simultaneously
- Adaptive rate limiting: Server-friendly dynamic concurrency
- Cache validation: Automatic staleness checking
Pattern Extraction
Learn from your high-scoring pages to replicate success:
npm start -- -s https://example.com --extract-patternsWhat it does:
- Analyzes pages with ≥70 served/rendered score
- Extracts 6 pattern categories with real examples:
- Structured Data (JSON-LD)
- Semantic HTML Structure
- Standard Form Field Naming
- Persistent Error Messages
- Explicit State Attributes
- llms.txt Implementation
- Provides priority (Critical/High) and effort (Low/Moderate) ratings
- Generates pattern_library.md with up to 5 examples per pattern
Use case: Identify what works on your best pages and apply it site-wide.
Regression Detection
Track changes over time with CI/CD-ready regression detection:
npm start -- -s https://example.com --enable-historyWhat it does:
- Compares current run with baseline (establishes if missing)
- Detects regressions in 5 categories:
- Performance (Critical: >30% increase, Warning: >15%)
- Accessibility (Critical: any error increase)
- SEO (Critical: >10% decrease, Warning: >5%)
- LLM suitability (Served score critical, Rendered score warning)
- URL count (Warning: significant change)
- Generates regression_report.md with severity classifications
- Returns non-zero exit code for critical regressions (CI/CD integration)
Use case: Catch breaking changes before deployment.
Ethical Scraping
The tool respects robots.txt by default:
# Normal audit (respects robots.txt)
npm start -- -s https://example.com
# Force scraping (bypass robots.txt - use with caution)
npm start -- -s https://example.com --force-scraperobots.txt Compliance:
- Fetches robots.txt before any crawling begins
- Interactive prompts if URLs are blocked
- Runtime force-scrape toggle available
- 100-point quality scoring for robots.txt files
What’s checked:
- AI-specific user agents (GPTBot, ClaudeBot) - 30 pts
- Sitemap references - 20 pts
- Sensitive path protection (admin, cart, account) - 25 pts
- llms.txt references - 15 pts
- Helpful comments - 10 pts
- Completeness - 10 pts
Quality levels:
- Excellent (80+): Professional-grade AI agent guidance
- Good (60-79): Solid foundation, minor improvements needed
- Fair (40-59): Basic compliance, significant gaps
- Poor (<40): Critical issues, immediate action needed
Understanding Your Reports
Core Reports (15 files)
1. LLM
General Suitability (llm_general_suitability.csv)
Purpose: Overall AI agent compatibility score
Key Columns:
url: Page URLserved_score: Score for served HTML (0-100) - works for ALL agentsrendered_score: Score for rendered HTML (0-100) - works for browser agentsoverall_score: Weighted average emphasizing served HTMLissues_found: Number of compatibility issues detectedschemaDisambiguation: Whether each JSON-LD block has exactly one@typevalue (Yes/No)totalSchemas: Total number of Schema.org JSON-LD blocks foundschemasWithMultipleTypes: Number of schemas with multiple@typevalues (ambiguous)hasInlineStyles: Whether page contains inline CSS (Yes/No)inlineStyleElements: Count of elements withstyle=attributesinlineStyleScripts: Count of inline<style>tagsexternalStylesheets: Count of external stylesheet referencesinlineCSSRatio: Percentage of elements with inline styles
Interpreting Scores:
| Score Range | Category | Meaning | Action Required |
|---|---|---|---|
| 0-39 | Poor | Major issues preventing AI agent compatibility | Immediate action needed |
| 40-59 | Fair | Several essential issues to fix | Systematic improvements required |
| 60-79 | Good | Minor improvements needed | Refinement and optimization |
| 80-100 | Excellent | Works well for all AI agents | Maintain and monitor |
Priority Fixes Based on Score:
- Served score <40: Focus immediately on:
- Add structured data (JSON-LD)
- Make pricing complete and visible
- Ensure state is in HTML attributes
- Fix error message persistence
- Rendered score <60: Focus on:
- Add explicit state attributes
- Implement inline validation
- Add loading state indicators
- Make dynamic content semantic
Schema Type Disambiguation (Chapter 10):
AI agents trained on entertainment scripts (films, TV shows, scripted
dialogue) may confuse professional content with fictional dialogue
without explicit Schema.org type markup. The audit checks that each
JSON-LD block has exactly ONE @type value:
- Proper disambiguation (+5 points): Each schema has single, specific type
- Multiple types penalty (-3 points per violation):
Schema blocks with
["Article", "NewsArticle"]or similar arrays
Common violations:
{
"@type": ["Article", "NewsArticle"], // WRONG - creates ambiguity
"headline": "Legal Analysis of New Legislation"
}Correct implementation:
{
"@type": "AnalysisNewsArticle", // RIGHT - single specific type
"headline": "Legal Analysis of New Legislation"
}Use specific types: Legislation,
LegalDocument, MedicalScholarlyArticle,
AnalysisNewsArticle, TechArticle rather than
generic Article or multiple types.
Inline CSS Detection (Chapter 10):
CLI agents (Claude Code, Cline) and server-based agents cannot execute JavaScript or process inline styles. Inline CSS adds noise to DOM without providing semantic value:
- External-only bonus (+8 points): No inline styles, external stylesheets present
- Inline CSS penalty (-10 × ratio): Based on percentage of elements with inline styles
What counts as inline CSS:
style=attributes on HTML elements<style>tags in document head or body- Inline style scripts that manipulate styles
Recommendation: Move all styling to external CSS files. Use semantic HTML + external stylesheets for maximum agent compatibility.
Dynamic Content Patterns (Chapter 2):
The audit detects timing-dependent content patterns that confuse AI agents by changing or revealing content over time:
carouselsTotal: Total number of carousels detectedcarouselsInformational: Carousels displaying critical content (product showcases, testimonials, portfolios)carouselsDecorative: Carousels for visual enhancement (hero banners, mastheads)carouselsWithAttributes: Carousels with proper data-slide-index and aria-label attributesautoplayVideos: Count of autoplay video/audio elementsautoplayWithControls: Autoplay media with pause controls (WCAG 2.2.2 compliant)animatedGifs: Count of animated GIF imagesgifsWithAltText: Animated GIFs with alt text descriptionshasAnimationLibraries: Presence of animation libraries (Typed.js, TypeIt, GSAP, AOS, Animate.css)visualDynamismDetected: Visual content changes detected via screenshot comparison (typewriters, tickers, rotating text)visualDynamismUniqueStates: Number of distinct visual states observed across 3 screenshotsjsDependentPricing: Price information only visible after JavaScript execution (invisible to CLI agents)
Scoring penalties:
- Informational carousels without attributes: -8 per carousel (high severity, hides critical content like product showcases)
- Decorative carousels without attributes: -3 per carousel (medium severity, accessibility issue)
- Autoplay media without controls: -8 per video (WCAG 2.2.2 violation, agent timing instability)
- Animated GIFs without alt text: -3 per GIF (accessibility and agent comprehension issue)
- Animation libraries detected: -2 informational warning (risks content invisibility)
- Visual dynamism detected: -5 points (screenshot comparison revealed changing content)
- JavaScript-dependent pricing: -15 points (critical severity, blocks CLI agent purchase recommendations)
Visual Dynamism Detection:
The audit takes 3 screenshots at random 2-5 second intervals and compares their visual hash. If screenshots differ, visual content changes are occurring:
- Typewriter animations: Text that types character-by-character (“AEM UPGRADE SPECIALISTS” → “AEM EXPERTS” → “SECURITY”)
- Rotating headlines: Headlines that cycle through different messages
- Ticker tapes: Scrolling text that moves continuously
- Fade-in sequences: Content that appears or disappears over time
This complements library detection by catching animations implemented with custom JavaScript or CSS that don’t use known libraries.
Common issues:
- Product carousels hide content: Manual advance shows only first slide, auto-advance changes content mid-parse
- Typewriter text invisible: Animated text reveals gradually, served HTML may be empty (detected via screenshot comparison)
- Background video without role: Agents cannot determine if video is decorative or informational
- Autoplay without controls: Violates WCAG 2.2.2, causes agent page instability
- Visual dynamism detected: Content changes over time (rotating text, tickers) causing agents to miss information depending on snapshot timing
Fixes:
- Add data-slide-index attributes to carousel slides with aria-label=“Slide N of M”
- Provide static “View all” alternatives for carousel content using
<details>elements - Ensure animated text is fully visible in served HTML before JavaScript enhancement
- Mark background media with data-video-role=“decorative” or data-video-role=“informational”
- Add pause controls for all autoplay media (required for animations >5 seconds per WCAG 2.2.2)
- Add alt text to all animated GIFs with aria-describedby for longer descriptions
See Chapters 2 and 11 for complete patterns and implementation guidance.
2. robots.txt
Quality Report (robots_txt_quality.csv)
Purpose: Evaluates robots.txt file for AI agent readiness
Key Columns:
score: Overall quality (0-100)has_ai_user_agents: Declares AI bot user agents (boolean)ai_user_agent_count: Number of AI agents declaredhas_sitemap: Includes sitemap declaration (boolean)has_sensitive_path_protection: Protects admin/account paths (boolean)protected_path_count: Number of protected pathshas_llms_txt_reference: References llms.txt file (boolean)has_helpful_comments: Includes explanatory comments (boolean)
Scoring Breakdown:
| Component | Max Points | Criteria |
|---|---|---|
| AI User Agents | 30 | 3+ agents = 30pts, 1-2 agents = 15pts, 0 agents = 0pts |
| Sitemap Declaration | 20 | Present = 20pts, Missing = 0pts |
| Path Protection | 25 | 3+ paths = 25pts, 1-2 paths = 15pts, 0 paths = 0pts |
| llms.txt Reference | 15 | Present = 15pts, Missing = 0pts |
| Helpful Comments | 10 | 3+ comments = 10pts, 1-2 = 5pts, 0 = 0pts |
Priority Fixes:
- Score <50: Add sitemap declaration and 2-3 AI user agents immediately
- Score 50-70: Add protected paths and llms.txt reference
- Score 70-85: Add more AI user agents and helpful comments
- Score 85-100: Maintain and monitor
3. llms.txt
Quality Report (llms_txt_quality.csv)
Purpose: Evaluates llms.txt file quality
Key Columns:
score: Overall quality (0-105, includes bonuses)has_title: Site title present (boolean)has_description: Description present (boolean)has_contact: Contact information present (boolean)section_count: Number of major sectionshas_access_guidelines: Access policies declared (boolean)has_rate_limits: Rate limits specified (boolean)has_api_info: API information provided (boolean)
Scoring Breakdown:
| Component | Points | Criteria |
|---|---|---|
| Core Elements | 40 | Title (10), Description (10), Contact (10), Last Updated (10) |
| Sections | 30 | 5+ sections (30), 3-4 sections (20), 1-2 sections (10) |
| Content Length | 15 | Substantial content (15), Moderate (10), Minimal (5) |
| External Links | 10 | 3+ links (10), 1-2 links (5), None (0) |
| Specificity | 5 | Detailed policies (5), Basic (3), Generic (0) |
| Bonus Points | 5 | Rate limits, API docs, attribution requirements |
Priority Fixes:
- No file: Create basic llms.txt with title, description, contact
- Score <40: Add access guidelines and rate limits
- Score 40-70: Add API information and external links
- Score 70-90: Increase detail and specificity
- Score 90-105: Comprehensive, maintain and update
4. SEO Reports
(seo_report.csv, seo_scores.csv)
Purpose: SEO factors that also benefit AI agents
Key Factors:
- Meta descriptions present and adequate length
- Title tags optimised
- Heading hierarchy (H1, H2, H3)
- Image alt text
- Internal linking structure
- Canonical URLs
Agent Relevance:
SEO best practices match agent needs:
- Clear titles help agents understand page purpose
- Structured headings provide content hierarchy
- Alt text makes images interpretable
- Internal links show relationships
5.
Accessibility Report (accessibility_report.csv,
wcag_report.md)
Purpose: WCAG compliance (benefits agents and humans)
Key Factors:
- ARIA labels and roles
- Form field associations
- Semantic HTML structure
- Keyboard navigation
- Focus management
Agent Benefits:
- ARIA labels provide context
- Semantic structure aids parsing
- Form associations clarify relationships
- Role attributes indicate purpose
6. Image
Optimization Report (image_optimization.csv)
Purpose: Image metrics, alt text quality, and compression analysis
Key Fields:
- Page URL
- Image URL
- File Size (KB)
- Dimensions
- Format
- Alt Text
- Alt Text Quality Score
- Is Responsive
- Lazy Loaded
- Compression Level
- Optimization Score
- Recommendations
Agent Benefits:
- Alt text makes images interpretable for agents
- Responsive images indicate mobile-friendly content
- Optimization recommendations improve performance for all users
7. Link Analysis Report
(link_analysis.csv)
Purpose: Internal/external link structure and navigation analysis
Key Fields:
- Source URL
- Target URL
- Link Text
- Link Type (internal/external)
- Follow Type (follow/nofollow)
- HTTP Status
- Redirect Chain
- Content Type
- In Navigation
- Link Depth
- Link Quality Score
Agent Benefits:
- Clear link structure aids navigation
- Descriptive link text improves context
- Navigation links help agents understand site structure
8. Content Quality
Report (content_quality.csv)
Purpose: Content analysis including freshness, uniqueness, and media richness
Key Fields:
- URL
- Word Count
- Content Freshness Score
- Content Uniqueness Score
- Grammar Score
- Media Richness Score
- Top Keywords
- Overall Content Score
Agent Benefits:
- Fresh content indicates current information
- Unique content reduces confusion with duplicate pages
- Rich media provides additional context when properly marked up
9. Security Report
(security_report.csv)
Purpose: Security headers analysis and HTTPS configuration
Key Fields:
- URL
- HTTPS Status
- HSTS Header
- CSP Header
- X-Frame-Options
- X-Content-Type-Options
- Security Score
- Recommendations
Agent Benefits:
- Secure sites build trust with agents
- Security headers indicate professional implementation
- HTTPS required for many agent interactions
Enhanced Reports (Optional)
Executive
Summary (executive_summary.md,
executive_summary.json)
Generated with:
--generate-executive-summary
Contains:
- High-level overview of site health
- Critical issues requiring immediate attention
- Recommended priorities
- Score trends (if history enabled)
Use for: Stakeholder communication, quick status checks
Dashboard
(dashboard.html)
Generated with:
--generate-dashboard
Contains:
- Visual score representations
- Historical trends (if history enabled)
- Comparative analysis
- Actionable recommendations
Use for: Regular monitoring, team reviews, progress tracking
16.
robots.txt Quality Report (robots_quality_report.md)
Purpose: Evaluate your robots.txt file for AI agent compatibility
Key Sections:
- Overall Score (0-100): Quality level (Excellent/Good/Fair/Poor)
- Quality Criteria Breakdown: 6 scored categories
- Issues Found: Specific problems detected
- Recommendations: Actionable improvements
- Example robots.txt: Suggested implementation
Interpreting Score:
| Score | Quality Level | Meaning |
|---|---|---|
| 80+ | Excellent | Professional-grade AI agent guidance |
| 60-79 | Good | Solid foundation, minor improvements |
| 40-59 | Fair | Basic compliance, significant gaps |
| <40 | Poor | Critical issues, immediate action needed |
Priority Fixes:
- Missing AI-specific user agents (30 pts): Add GPTBot, ClaudeBot, etc.
- No sitemap references (20 pts): Add Sitemap: directives
- Unprotected sensitive paths (25 pts): Block /admin, /cart, /account
- No llms.txt references (15 pts): Add comments referencing llms.txt
- No helpful comments (10 pts): Explain rules for maintainability
Chapter References:
- Chapter 5: The Content Creator’s Dilemma (robots.txt best practices)
- Chapter 10: Generative Engine Optimization (AI-specific directives)
- Appendix G: Resource Directory (robots.txt examples)
17. Pattern Library
Report (pattern_library.md)
Purpose: Learn from your high-scoring pages
Generated when: --extract-patterns flag
used
Key Sections:
- Methodology: How patterns were extracted
- 6 Pattern Categories: Structured Data, Semantic HTML, Form Patterns, Error Handling, State Management, llms.txt
- Real Examples: Up to 5 examples per pattern from your site
- Implementation Guide: How to apply patterns site-wide
- Validation Tools: Links to validators
Pattern Priority Ratings:
- Critical + Low effort: Implement immediately
- High + Moderate effort: Plan for next sprint
- Critical + Moderate effort: Prioritize over High/Low
Use Cases:
- Replicate success: See what works on your best pages
- Training material: Show developers real examples
- Quality baseline: Establish consistency standards
- Onboarding: Help new team members understand patterns
Chapter References:
- Chapter 10: Generative Engine Optimization (pattern implementation)
- Chapter 11: Designing for Both (universal patterns)
- Appendix E: AI Patterns Quick Reference (pattern catalog)
18. Regression Report
(regression_report.md)
Purpose: Detect breaking changes before deployment
Generated when: --enable-history flag
used
Key Sections:
- Executive Summary: Critical/warning/info counts
- Critical Regressions: Issues requiring immediate attention
- Warning Regressions: Issues to monitor
- Informational Changes: Non-critical updates
- Recommendations: Specific actions to take
Regression Severity:
Critical (Exit code 1 - CI/CD fails):
- Performance: >30% increase in load time/LCP/FCP/CLS
- Accessibility: Any error count increase
- SEO: >10% score decrease
- LLM Suitability (Served): >10% score decrease
Warning (Exit code 0 - CI/CD passes with warnings):
- Performance: >15% increase
- SEO: >5% score decrease
- LLM Suitability (Rendered): >10% score decrease
- URL count: Significant change
Informational:
- Minor improvements or degradations
- Non-critical metric changes
CI/CD Integration:
# In your CI/CD pipeline
npm start -- -s https://staging.example.com --enable-history
# Returns exit code 1 if critical regressions found
# Pipeline fails, preventing deploymentChapter References:
- Chapter 12: Technical Advice (testing and validation)
- Appendix B: Proven Lessons (regression prevention)
Prioritizing Improvements
Step 1: Run Initial Audit
npm start -- -s https://example.com/sitemap.xml -c -1 \
--enable-history \
--generate-dashboard \
--generate-executive-summaryStep 2: Review Executive Summary
Focus on:
- Overall LLM suitability score
- Critical issues flagged
- Quick wins identified
Step 3: Categorize Issues by Priority
Critical (Fix Immediately):
- Served HTML score <40
- robots.txt score <30
- No structured data
- Errors vanish/non-persistent
- Incomplete pricing
High Priority (Fix This Quarter):
- Served HTML score 40-60
- robots.txt score 30-60
- Limited structured data
- Inconsistent state attributes
- Complex form validation issues
Medium Priority (Fix This Year):
- Rendered HTML score <60
- llms.txt missing or basic
- SEO issues affecting discoverability
- Accessibility improvements needed
Low Priority (Ongoing):
- Score optimisation above 80
- Advanced features
- Competitive differentiation
Step 4: Create Action Plan
Based on your scores:
## Action Plan: [Your Site]
**Audit Date:** [Date]
**Overall Score:** [Score]/100
**Priority:** [Critical/High/Medium/Low]
### Immediate Actions (This Week)
1. [Issue from served HTML score]
2. [Issue from robots.txt]
3. [Issue from error persistence]
### Short Term (This Month)
1. [Structured data additions]
2. [State attribute improvements]
3. [Form validation fixes]
### Medium Term (This Quarter)
1. [robots.txt enhancement]
2. [llms.txt creation/improvement]
3. [Comprehensive structured data]
### Long Term (This Year)
1. [Advanced features]
2. [API development]
3. [Monitoring and analytics]Tracking Progress
Run Monthly Audits
# Monthly audit with history
npm start -- -s https://example.com/sitemap.xml -c -1 \
--enable-history \
--generate-dashboardView Dashboard
open results/dashboard.htmlDashboard shows:
- Score trends over time
- Issue resolution tracking
- Improvement velocity
- Regression detection
Key Metrics to Monitor
- Overall Score Trend
- Target: Steady upward trend
- Warning: Declining or flat trend
- Issue Count
- Target: Decreasing over time
- Warning: Increasing or stagnant
- Served HTML Score
- Target: >80 within 6 months
- Critical: <40 requires immediate action
- robots.txt Quality
- Target: >70 within 3 months
- Warning: <50 indicates gaps
Common Scenarios
Scenario 1: E-commerce Site (Low Score)
Initial Audit Results:
- Overall: 42/100
- Served: 38/100
- Rendered: 52/100
- robots.txt: 25/100
Action Plan:
Week 1:
- Add JSON-LD structured data to product pages
- Make pricing complete and visible
- Create comprehensive robots.txt
Month 1:
- Fix error message persistence
- Add state attributes to cart
- Implement inline form validation
Quarter 1:
- Create llms.txt file
- Add structured data to all pages
- Implement API endpoints
Expected Results After Quarter 1:
- Overall: 75-80/100
- Served: 70-75/100
- Rendered: 75-80/100
- robots.txt: 80-85/100
Scenario 2: Content Publisher (Medium Score)
Initial Audit Results:
- Overall: 65/100
- Served: 72/100
- Rendered: 61/100
- robots.txt: 55/100
- llms.txt: Missing
Action Plan:
Week 1:
- Create llms.txt with attribution requirements
- Enhance robots.txt with AI user agents
Month 1:
- Add article structured data
- Improve meta descriptions
- Fix heading hierarchy
Quarter 1:
- Optimise content extraction policies
- Implement rate limiting headers
- Add API documentation
Expected Results After Quarter 1:
- Overall: 80-85/100
- Served: 85-88/100
- Rendered: 75-80/100
- robots.txt: 80-85/100
- llms.txt: 75-80/105
Scenario 3: SaaS Application (High Score, Maintenance)
Initial Audit Results:
- Overall: 82/100
- Served: 85/100
- Rendered: 80/100
- robots.txt: 85/100
- llms.txt: 78/105
Action Plan:
Monthly:
- Run audits to detect regressions
- Monitor for new issues
- Update AI user agent list
Quarterly:
- Optimise low-scoring pages
- Review and update llms.txt
- Benchmark against competitors
Annually:
- Comprehensive review
- Implement advanced features
- Update documentation
Maintenance Goals:
- Keep scores above 80
- Detect regressions immediately
- Stay current with standards
Advanced Usage
Agency & Partner Features
The Web Audit Suite includes features specifically for agencies and partners managing multiple client sites.
Bulk Auditing:
# Audit multiple domains from CSV file
npm start -- --bulk prospects.csv \
--agency-name "TechAudit Agency" \
--agency-logo "https://techaudit.com/logo.png" \
--output ./client-auditsFeatures:
--bulk <file>: Run audit on multiple domains from a CSV file- Input: CSV with
domaincolumn (e.g.,domain\nexample.com\nclient2.com) - Output:
bulk_audit_summary.csvmaster report
- Input: CSV with
--agency-name <string>: Agency name for white-labeling reports- Replaces “Web Audit Suite” in Dashboard footer/title
--agency-logo <path>: Path or URL to agency logo- Adds logo to Dashboard header
Use Cases:
- White-label reports for client delivery
- Prospect analysis for sales pipeline
- Portfolio-wide monitoring for existing clients
- Competitive analysis for market research
Cache Management
The tool maintains a cache to improve performance on repeated audits.
Cache Location: .cache directory
(automatically created)
Cache Format: JSON files
Cache Naming: MD5 hash of URLs
Cache Control Options:
# Use only cached data (skip fetching)
npm start -- --cache-only -o reports-from-cache
# Disable caching entirely
npm start -- --no-cache -s https://example.com
# Clear cache before starting
npm start -- --force-delete-cache -s https://example.comCache Staleness Checking:
The tool automatically validates cache freshness using HTTP HEAD requests:
- Checks Last-Modified header on cached pages
- Compares with cache creation time
- Invalidates cache if source is newer
- Falls back to cache if HEAD request fails
Best Practices:
- Clear cache periodically for fresh data
- Use
--cache-onlyfor report regeneration without re-crawling - Use
--force-delete-cachewhen site structure changes significantly
Network Error Handling
The tool includes robust network error handling with automatic retry mechanisms.
Network Error Types Detected:
- DNS failures
- Connection timeouts
- Host unreachable errors
- Browser network errors
- SSL/TLS handshake failures
- Rate limiting errors
- Cloudflare challenges (automatic bypass attempt)
Retry Mechanism:
When a network error occurs:
- The tool pauses and displays error details
- You’re prompted to retry after fixing the issue
- Automatic retry up to 3 times
- You can cancel the operation if needed
Example Network Error Flow:
[ERROR] Network error: Could not connect to example.com
Reason: ETIMEDOUT
Would you like to retry? (yes/no): yes
Retrying connection... (attempt 1/3)
Handling Network Issues:
- Check internet connection before starting
- Use retry mechanism when network errors occur
- Monitor network stability during long runs
- Consider rate limiting for large sites
Language Variant Filtering
By default, the tool skips non-English language variants to avoid duplicate content analysis.
Default Behavior:
- Processed by default:
/en,/us - Skipped by default:
/fr,/es,/de, etc.
Override:
# Include all language variants
npm start -- -s https://example.com --include-all-languagesFiltering Applies To:
- URL extraction from sitemaps
- Report generation
- Content analysis
Use Cases:
- Skip filtering (default): Faster audits, focus on primary content
- Include all languages: Multilingual site audits, comprehensive analysis
Performance Optimization Guide
Understanding Performance Features
The Web Audit Suite includes production-tested optimizations that reduce analysis time by 3-5x:
Before optimization: 100 URLs in ~45 minutes After optimization: 100 URLs in ~10 minutes
Browser Pooling
What it does: Maintains 3 reusable Puppeteer browser instances
Benefits:
- 97% reduction in browser launch overhead
- Eliminates 2-5 second delay per URL
- Automatic restart after 50 pages to prevent memory leaks
Configuration:
# Default (3 browsers)
npm start -- -s https://example.com
# Larger pool for faster analysis
npm start -- -s https://example.com --browser-pool-size 5
# Disable pooling
npm start -- -s https://example.com --browser-pool-size 0When to adjust:
- Increase (5-7): Large sites (1000+ URLs), powerful hardware
- Decrease (1-2): Limited memory, unstable sites, debugging
- Disable (0): Troubleshooting browser issues
Concurrent URL Processing
What it does: Processes multiple URLs simultaneously
Benefits:
- 3-5x speedup for URL processing phase
- Efficient use of browser pool
- Integrates with adaptive rate limiting
Configuration:
# Default (3 concurrent)
npm start -- -s https://example.com
# Higher concurrency for large sites
npm start -- -s https://example.com --url-concurrency 5
# Sequential processing
npm start -- -s https://example.com --url-concurrency 1When to adjust:
- Increase (5-10): Fast servers, large sites, powerful hardware
- Decrease (1-2): Slow servers, rate limiting issues, debugging
Adaptive Rate Limiting
What it does: Monitors server responses and adjusts concurrency
Benefits:
- Server-friendly (avoids overwhelming servers)
- Automatic backoff on 429/503 responses
- Gradual recovery when server stabilizes
How it works:
- Starts with configured concurrency (default: 3)
- Monitors for 429 (Too Many Requests) or 503 (Service Unavailable)
- Reduces concurrency on errors (exponential backoff)
- Gradually increases when server recovers
- No configuration needed - works automatically
Cache Staleness Checking
What it does: Validates cache freshness with HTTP HEAD requests
Benefits:
- Ensures data accuracy without re-analysis
- Automatic invalidation when source changes
- Minimal overhead (HEAD requests only)
How it works:
- Checks Last-Modified header on cached pages
- Compares with cache creation time
- Invalidates cache if source is newer
- Falls back to cache if HEAD request fails
- No configuration needed - works automatically
Recommended Configurations
Small sites (<100 URLs):
npm start -- -s https://example.com
# Defaults work wellMedium sites (100-500 URLs):
npm start -- -s https://example.com --browser-pool-size 5 --url-concurrency 5Large sites (500-5000 URLs):
npm start -- -s https://example.com --browser-pool-size 7 --url-concurrency 7Very large sites (5000+ URLs):
# Process in batches
npm start -- -s https://example.com -c 1000 --browser-pool-size 7 --url-concurrency 7Slow or rate-limited servers:
npm start -- -s https://example.com --browser-pool-size 2 --url-concurrency 2Custom Configuration
Create .web-audit-config.json:
{
"concurrency": 10,
"timeout": 30000,
"userAgent": "Web-Audit-Suite/2.0",
"viewport": {
"width": 1920,
"height": 1080
},
"thresholds": {
"llm_suitability": {
"low": 40,
"medium": 60,
"high": 80
}
}
}CI/CD Integration
Add to .github/workflows/audit.yml:
name: Web Audit
on:
schedule:
- cron: '0 0 * * 0' # Weekly on Sunday
workflow_dispatch:
jobs:
audit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Node
uses: actions/setup-node@v3
with:
node-version: '20'
- name: Install Web Audit Suite
run: |
git clone
cd mx-handbook/mx-audit
npm install
- name: Run Audit
run: |
cd mx-handbook/mx-audit
npm start -- -s https://example.com/sitemap.xml -c -1 \
--enable-history \
--generate-dashboard
- name: Check Thresholds
run: |
# Fail if score drops below threshold
SCORE=$(jq '.overall_score' results/executive_summary.json)
if (( $(echo "$SCORE < 70" | bc -l) )); then
echo "Score $SCORE below threshold 70"
exit 1
fi
- name: Upload Results
uses: actions/upload-artifact@v3
with:
name: audit-results
path: mx-handbook/mx-audit/results/Interpreting Specific Issues
“Served HTML score significantly lower than rendered”
Meaning: Your site relies heavily on JavaScript for critical content
Impact: Most agents (CLI, API-based) cannot access your content
Fix:
- Server-side render critical content
- Use progressive enhancement
- Ensure HTML contains data before JavaScript runs
“Error messages non-persistent”
Meaning: Errors vanish or are only shown briefly
Impact: Agents miss errors, retry without fixing issues
Fix:
- Remove toast notifications
- Add persistent error summary at top of forms
- Keep errors visible until user corrects them
“Missing structured data”
Meaning: No JSON-LD, microdata, or schema.org markup
Impact: Agents cannot reliably extract product/article information
Fix:
- Add JSON-LD script tags to pages
- Use schema.org vocabulary
- Start with Product, Article, or LocalBusiness types
“Incomplete pricing information”
Meaning: Shows “From £99” but actual price is hidden
Impact: Agents compare wrong prices, users surprised at checkout
Fix:
- Display total price upfront
- Include VAT status
- Show delivery costs
- Use structured data for machine-readable prices
“Multiple @type values in Schema.org blocks”
Meaning: JSON-LD blocks contain arrays like
["Article", "NewsArticle"]
Impact: AI agents trained on entertainment scripts may confuse professional content with fictional dialogue. Multiple types create ambiguity.
Fix:
- Use exactly ONE
@typeper JSON-LD block - Choose the most specific type:
MedicalScholarlyArticleoverArticle - For legal content: use
LegislationorLegalDocument - For business analysis: use
AnalysisNewsArticle - For medical content: use
MedicalScholarlyArticle
See Chapter 10 for complete guidance on content type disambiguation.
“High inline CSS ratio”
Meaning: Many elements have style=
attributes or inline <style> tags
Impact: CLI agents and server-based agents cannot execute inline styles. Inline CSS adds noise to DOM structure without providing semantic value.
Fix:
- Move all styling to external CSS files
- Remove
style=attributes from HTML elements - Remove inline
<style>tags from document - Use semantic HTML structure (proper elements, clear hierarchy)
- Keep styling separate from content for maximum agent compatibility
CLI agents see the DOM but cannot process inline styles, making style-dependent content confusing or invisible.
“Carousels without proper attributes”
Meaning: Product carousels, testimonial sliders, or portfolio galleries lack data-slide-index and aria-label attributes
Impact: Agents see only the first slide. Manual advance requires user interaction. Auto-advance changes content mid-parse causing timing failures.
Fix:
- Add data-total-slides=“5” to carousel container
- Add data-slide-index=“1”, data-slide-index=“2” to each slide
- Add aria-label=“Slide 1 of 5” to each slide
- Provide static “View all” alternative using
<details>element with data-agent-visible=“true” - Distinguish informational (product, testimonial) from decorative (hero, banner) carousels
See Chapter 11 “Static Alternatives for Dynamic Content” for complete patterns.
“Animation libraries detected”
Meaning: Typed.js, TypeIt, GSAP, AOS, or Animate.css libraries present on page
Impact: Animated text may be invisible in served HTML. Content reveals gradually, causing agents to miss information. Timing-dependent content extraction failures.
Fix:
- Ensure all text content exists in served HTML before JavaScript enhancement
- Use animation as progressive enhancement, not as primary content delivery
- Add data-animation-state=“complete” after animation finishes
- Provide pause controls for animations >5 seconds (WCAG 2.2.2)
“Autoplay media without controls”
Meaning: Video or audio elements with autoplay attribute but no controls attribute
Impact: Violates WCAG 2.2.2 (Pause, Stop, Hide). Causes agent page instability as content loads unpredictably. Agents cannot pause motion that persists >5 seconds.
Fix:
- Add controls attribute to all autoplay media:
<video autoplay controls> - Add muted attribute for autoplay compliance
- Mark background videos with data-video-role=“decorative”
- Provide transcripts for informational video with data-video-role=“informational”
“Animated GIFs without alt text”
Meaning: IMG elements with .gif extension lack alt attributes
Impact: Agents cannot interpret animated visual content. Accessibility failure. Information conveyed only through motion is lost.
Fix:
- Add alt text to all animated GIFs describing the animation content
- Use aria-describedby for longer descriptions
- Consider replacing informational GIFs with static images + text descriptions
- Reserve animated GIFs for purely decorative purposes
“Visual content changes detected”
Meaning: Screenshot comparison revealed page content changing over time (typewriter animations, rotating text, tickers)
Impact: Agents snapshot page at random moments, missing content that hasn’t appeared yet or has already cycled away. Timing-dependent failures.
Example: Arbory Digital homepage cycles “AEM UPGRADE SPECIALISTS” → “AEM EXPERTS” → “SECURITY” - agents only see one variant.
Fix:
- Ensure ALL text variations exist in served HTML before JavaScript enhancement
- Add data-content-variations=“AEM UPGRADE SPECIALISTS|AEM EXPERTS|SECURITY” attribute
- Add data-content-complete=“true” after animation cycle completes
- Provide static
<noscript>alternative showing all content - Mark animated containers with data-animation-type=“typewriter” or data-animation-type=“ticker”
- Consider showing all variations in a list format for agents:
<ul data-agent-visible="true"><li>AEM UPGRADE SPECIALISTS</li><li>AEM EXPERTS</li><li>SECURITY</li></ul>
Detection method: The audit takes 3 screenshots at random intervals (2-5 seconds apart) and compares visual hashes. Different hashes indicate visual content changes. This catches custom animations that don’t use known libraries.
JavaScript-Dependent Pricing
Metric: jsDependentPricing
Meaning: Price information only appears after JavaScript execution, making it invisible to CLI agents (ChatGPT Shopping, Perplexity Shopping) and server-based agents that cannot execute JavaScript.
Why this matters: E-commerce agents need pricing information to make purchase recommendations. If prices only appear client-side via JavaScript, CLI agents see product descriptions but no prices, blocking purchase decisions entirely.
Common causes:
- React/Vue dynamic pricing - Price fetched from API and rendered client-side
- JavaScript-based currency conversion - Shows “Loading price…” in served HTML
- Lazy-loaded pricing - Price div exists but content added via JavaScript
- Dynamic discount calculations - Final price computed in browser
- Regional pricing detection - JavaScript determines user location and shows appropriate price
Real-world example: Product page shows
<div class="price"></div> in served HTML, but
actual price $99.99 only appears after JavaScript fetches
it from pricing API. ChatGPT Shopping cannot recommend this product
because it sees no price.
Penalty: -15 points (critical severity - blocks purchase recommendations)
How to fix:
- Server-side rendering: Render initial price in HTML using server-side templating (PHP, Django, Rails, Next.js SSR)
- Schema.org structured data: Add JSON-LD with Product schema including price property
- Meta tags: Include
<meta itemprop="price" content="99.99">for fallback - Data attributes: Add
data-price="99.99"anddata-currency="USD"to price elements - Noscript fallback: Provide
<noscript><span class="price">$99.99</span></noscript>alternative
Complete example:
<!-- Served HTML (visible to all agents) -->
<div class="product" itemscope itemtype="https://schema.org/Product">
<h1 itemprop="name">Premium Laptop</h1>
<!-- Price visible in served HTML -->
<div class="price"
itemprop="offers"
itemscope
itemtype="https://schema.org/Offer"
data-price="999.99"
data-currency="USD">
<span itemprop="price" content="999.99">$999.99</span>
<meta itemprop="priceCurrency" content="USD">
</div>
<!-- JSON-LD for structured data -->
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "Product",
"name": "Premium Laptop",
"offers": {
"@type": "Offer",
"price": "999.99",
"priceCurrency": "USD"
}
}
</script>
<!-- JavaScript can enhance with regional pricing, discounts -->
<script>
// Progressive enhancement only - base price already visible
enhancePricing();
</script>
</div>Detection pattern: Audit compares served HTML (before JavaScript) against rendered HTML (after JavaScript) using these price patterns:
- Currency symbols:
$,£,€,¥ - Price formats:
$99.99,99.99 USD,£99 - Schema.org:
itemprop="price" - Data attributes:
data-price= - Class names:
class="price" - JSON-LD:
"price": "99.99"
If any pattern matches rendered HTML but NOT served HTML, pricing is JavaScript-dependent.
Getting Help
- Documentation: <>
- Issues: <>
- Examples: See
web-audit-suite/examplesdirectory in repository
Summary Workflow
- Initial Audit: Run full site audit with all reports
- Review Dashboard: Understand current state
- Prioritize Issues: Critical → High → Medium → Low
- Implement Fixes: Start with highest impact, lowest effort
- Re-audit: Verify improvements
- Monitor: Monthly audits to track progress and catch regressions
- Maintain: Keep scores above thresholds
Target Timeline:
- Week 1: Critical issues fixed
- Month 1: High priority issues addressed
- Quarter 1: Medium priority improvements complete
- Year 1: Comprehensive AI agent readiness achieved
Web Audit Suite provides the measurement framework. The Implementation Cookbook (Appendix A) provides the fixes. Together, they transform your site from theory to measurable AI agent compatibility.
Home Top