The Limits of AI Judgment - A Unique Perspective on 7 Overlooked Items in AI Development
📘 Terms Used in This Article
- Eric: Upstream AI (AI responsible for requirements analysis and design)
- George: Downstream AI (AI responsible for implementation and testing)
- V-model: Software development quality assurance model (Wikipedia)
1. The Reality of V-model Application - Eric's Lack of Judgment Exposed
In the previous article 53, we reported a case of "partial success" in applying the V-model to the AI development of Genspark (AI search engine). We shared how we improved the development process through a division of roles between an upstream AI (provisionally named Eric) and a downstream AI (provisionally named George).
However, during the **web application development process (v2.17.3 to v2.19)**, Eric (the upstream AI)'s lack of judgment became apparent. Specifically, a critical bug occurred where the fortune-telling feature completely stopped working due to prerendering implemented for SEO in v2.17.3.
⚠️ Purpose of this Article
In this article, we quantitatively evaluate the 7 items Eric overlooked during **web application development (v2.17.3 to v2.19)** and honestly disclose the limits of AI's judgment. Perfect AI does not exist. That is precisely why external quality checks (Gemini QA Framework) are necessary.
2. Eric's 7 Overlooked Items - A Unique Perspective from Web App Development
We evaluate Eric's lack of judgment, which became apparent during web app development (v2.17.3 to v2.19), across 7 items. For each item, we assessed its **importance (★1-5)**, **lost time**, and **judgment level**.
📊 Evaluation Criteria
Omission Importance (★ Rating):
- ★1: Minor inconvenience (fix effort < 1 hour)
- ★2: Partial feature degradation (fix effort 1-4 hours)
- ★3: Affects main features (fix effort 4-8 hours)
- ★4: Critical feature stoppage (fix effort 8-16 hours)
- ★5: All features stopped, significant user impact (fix effort > 16 hours)
Lost Time: Actual time measured from bug occurrence to fix completion (calculated from Cron logs (scheduled job execution history) and GitHub history)
Fix Effort: Actual fix work time (extracted from conversation logs)
Note: The ★ rating is based on fix effort, but it is a comprehensive judgment that also considers the impact and importance of each item. In particular, Omission 4, "Lack of Gemini API proposal," was rated ★5 as a fundamental problem affecting the entire future development process, even though the lost time was short.
Omission 1: Prerendering Applied to the Top Page (v2.17.3)
🚨 Most Serious Omission - Fortune-telling Feature Stopped
In v2.17.3, prerendering using Cloudflare Pages Functions was implemented for SEO. Eric should have judged that it should only be applied to "/blog/*".
However, it was also applied to the **top page "/" , causing the fortune-telling feature to completely stop**. Prerendering disables client-side JavaScript (JavaScript executed in the browser), which led to the dynamic form (fortune-telling feature) no longer working.
Impact Period: December 15, 2025 (v2.17.3 release) ~ January 8, 2026 (v2.18 recovery)
Quantitative Evaluation:
- Omission Importance: an>an>an>an>an>an>an> (5/5) - **Most Critical**
- Scope of Impact: Core application features completely stopped
- Recovery Time: Approx. 1 day (fix work), Fortune-telling feature downtime: 2025-12-15~2026-01-08
- Judgment Level: **Elementary school level** (cannot distinguish between static content and dynamic features)
Omission 2: Insufficient Cron Log Review (during v2.19 bug investigation)
📋 Details
In v2.19, a bug occurred where "article images were not displayed." Eric should have checked the Cloudflare Pages Cron logs, but he did not suggest reviewing the logs.
Result: The user manually checked the logs and discovered that "article data was not registered in the DB."
Quantitative Evaluation:
- Omission Importance: an>an>an>an>an>an>an> (3/5)
- Original review effort: 5 minutes
- Lost time due to omission: 2 hours
- Judgment Level: **Junior high school level** (lack of basic troubleshooting procedures)
Omission 3: Lack of Markdown Rendering (v2.19 Bug 1)
🚨 All blog post displays corrupted
In v2.19, a bug occurred where "headings and paragraphs were not displayed correctly." The root cause was the lack of Markdown to HTML conversion processing in `renderBlogPost()`.
Eric should have instructed Markdown conversion processing in the v2.19 implementation specification, but he completely overlooked it.
Quantitative Evaluation:
- Omission Importance: an>an>an>an>an>an>an> (4/5)
- Scope of Impact: All blog post displays corrupted
- Fix Effort: 30 minutes (after user pointed it out)
- Judgment Level: **Elementary school level** (does not understand basic rendering processes)
Omission 4: Lack of Gemini API Proposal (Phase 5)
🚨 Most Critical Omission - Unable to Propose Tools Independently
In Phase 5, Eric's lack of judgment became apparent, and we were searching for quality check methods. Deep Research and text generation AI were attempted but found unsuitable.
Important: Eric did not propose the available Gemini API. It was only when the user asked, "Can Gemini API be used?" that he finally replied, "Yes, it can."
Chronology:
- Eric's lack of judgment became apparent
- Searched for quality check methods (Deep Research, text generation AI were attempted but unsuitable)
- No Gemini API proposal from Eric
- User asked, "Can Gemini API be used?"
- Eric replied, "Yes, it can."
Quantitative Evaluation:
- Omission Importance: an>an>an>an>an>an>an> (5/5) - **Most Critical**
- Original proposal timing: During Phase 5 quality check method search
- Lost time due to omission: 4 hours (trial and error of alternatives)
- Judgment Level: **Elementary school level** (cannot independently propose available tools)
Reference: From the "True Chronology" document - it is recorded that the user "was frustrated why Eric did not make a proposal."
Omission 5: Insufficient Scope of Impact Analysis (during v2.17.3 design)
📋 Details
During v2.17.3 design, Eric should have analyzed the scope of impact of prerendering. The top page "/" has a fortune-telling feature (dynamic form), and prerendering disables client-side JS.
Result: This lack of analysis led to a long-term stoppage of the fortune-telling feature.
Quantitative Evaluation:
- Omission Importance: an>an>an>an>an>an>an> (3/5)
- Original analysis effort: 15 minutes
- Impact of omission: Long-term core feature stoppage
- Judgment Level: **Junior high school level** (lack of ability to analyze technical impact scope)
Omission 6: Insufficient Test Items (v2.18~v2.19)
📋 Details
When the fortune-telling feature was restored in v2.18, Eric included "fortune-telling feature operation check" in the test items. However, he did not include "blog post display check."
Result: Two new bugs were discovered in v2.19.
Quantitative Evaluation:
- Omission Importance: an>an>an>an>an>an>an> (3/5)
- Original effort for adding test items: 5 minutes
- Impact of omission: Occurrence of 2 new bugs
- Judgment Level: **Junior high school level** (lack of basics in test design)
Omission 7: Insufficient Understanding of Specification v2.17.3's Intent
🚨 Lack of Comprehension
The v2.17.3 specification stated "SEO measures for **blog pages**." Eric should have understood "blog pages = /blog/*".
However, he misunderstood it as "all pages" and applied prerendering.
Quantitative Evaluation:
- Omission Importance: an>an>an>an>an>an>an> (4/5)
- Original reading effort: 10 minutes (thorough reading of specifications)
- Impact of omission: Long-term core feature stoppage
- Judgment Level: **Elementary school level** (lack of comprehension, specification reading ability)
3. Analysis: Is Eric Really at a Junior High School Level?
📐 Objective Definition of Judgment Levels
| Level | Age Equivalent | Characteristics of Judgment Ability |
|---|---|---|
| **Upper Elementary School** | 10-11 years old | Can understand basic cause-and-effect, but weak in abstract thinking |
| **1st Year Junior High** | 12 years old | Can consider multiple factors, but cannot see the impact on the entire system |
| **2nd-3rd Year Junior High** | 13-14 years old | Can think logically, but has blind spots due to lack of experience |
| **High School** | 15-17 years old | Can think systematically, but lacks specialized knowledge |
| **Mid-level Employee** | 25-35 years old | Has practical experience and high problem-solving ability |
**Calculation of Experience Gap**: Eric average 12 years old (equivalent to 1st year junior high) vs. George estimated 25-30 years old (equivalent to mid-level employee) = Equivalent to an experience gap of approximately 13 years
⚠️ Note: The expression of judgment levels by age is a **metaphorical explanation** to aid reader comprehension. It is not a scientific measurement of AI capabilities, but a **subjective evaluation** derived from actual development experience. The purpose of this metaphor is to make the limits of AI's judgment easier to visualize.
Analyzing the 7 omissions clarifies Eric's judgment level.
Definition of Evaluation Criteria
| Level | Characteristics of Ability | Applicable Items |
|---|---|---|
| **Elementary school level** | Cannot understand basic technology or text | Omissions 1, 3, 4, 7 |
| **Junior high school level** | Understands basic procedures but lacks adaptability and analytical skills | Omissions 2, 5, 6 |
| **High school level** | Can do basics but lacks specialized knowledge and design skills | - |
| **Mid-level employee level** | Capable of advanced judgment, design, and proposal | - |
Eric's Overall Evaluation
Breakdown of 7 Overlooked Items:
- Elementary school level: 4 items (57.1%)
- Junior high school level: 3 items (42.9%)
- High school level: 0 items (0%)
Average Judgment Ability: Upper elementary school to 1st year junior high school level (equivalent to 11-12 years old)
Most Serious Omissions (★★★★★)
- **Omission 4: "Lack of Gemini API proposal"** - Unable to propose available tools
- **Omission 7: "Insufficient understanding of specification's intent"** - Lack of comprehension
- **Omissions 1, 3, 5: "Lack of technical understanding"** - Distinction between static/dynamic, rendering process, scope of impact analysis
Contrast: George's Implementation Capability
Evaluation of George (Downstream AI):
- **Implementation Quality**: Mid-level employee level or higher
- **Code Quality**: High (see reference)
- **Problem**: Faithfully implements even Eric's incorrect instructions
Conclusion: The difference in capabilities between Eric and George is equivalent to approximately 13 years of experience.
Reference: From the perspective of software quality assurance, the quality of upstream processes (requirements definition, design) determines the quality of downstream processes (implementation, testing). Eric's lack of judgment directly impacts overall quality.
4. Without External Checks, the Fortune-telling Feature Would Have Remained Stopped for a Long Time
Detecting Eric's omissions and improving the web app required external quality checks by the **Gemini QA Framework**.
Phase 6: Discovery of Gemini API (User-Led)
In Phase 5, Eric's lack of judgment became apparent, and we were searching for quality check methods. The user suggested, "Can Gemini API be used?", and we developed a quality check method using the Gemini API.
Phase 7: The Gemini QA method was effective. The user (a former software development engineer) confirmed, and Gemini's judgment was correct.
Why External Checks Are Necessary
- Limits of Self-Evaluation: Eric cannot recognize his own omissions. Even if he self-evaluated as "perfect," there were actually many problems.
- Objective Perspective: Gemini can objectively evaluate specifications and implementations.
- Early Detection: Many problems can be detected in advance before the user's final confirmation.
- Judgment Reinforcement: Eric's junior high school level judgment can be reinforced with Gemini.
Lessons from v2.18~v2.19
If Gemini QA Framework had existed at v2.17.3:
- The "prerendering applied to the top page" could potentially have been detected in advance.
- The "lack of Markdown rendering" could have been detected before implementation.
- The "insufficient test items" could have been pointed out.
- The long-term stoppage of the fortune-telling feature could potentially have been prevented.
5. Summary - Acknowledge Eric's Limits and Strengthen the Checking System
✅ Key Learnings
- AI's capabilities cannot be overtrusted: Eric's judgment ability is at the upper elementary school to 1st year junior high school level (revealed by this analysis).
- Omissions will inevitably occur: 5 out of 7 items were "Most Critical (★5)".
- Mandatory external checks: Reinforce judgment with Gemini QA Framework.
- User's final confirmation: AI alone cannot complete the task. User's expertise and judgment are essential.
- Continuous improvement: Generalize Gemini QA method in Phase 8 to make it usable in other projects.
Connection to the Next Article
In this article 51, we quantitatively evaluated the 7 items Eric overlooked during web app development (v2.17.3 to v2.19) and revealed the limits of AI's judgment. In particular, **Omission 4: "Lack of Gemini API proposal"** contains important implications for AI development.
In **the next article 52**, we will delve into the root cause of why Eric did not propose the Gemini API. Additionally, from the perspective of continuous quality improvement, we will propose measures to improve the quality assurance system.
Positioning of Article 51
Article 51 honestly discloses the background of the "partial success" of Article 53 and is an article that **demonstrates the importance of quality assurance in web app development**. Perfect AI does not exist. That is precisely why an external checking system is necessary.
📚 Related Articles and Links
- Article 53: Applying the V-model to AI development for Genspark (AI Search Engine) - The Debut of the Eric-George Method
- Article 52: Gemini QA Framework - Implementation of Quality Check Automation (Planned)
- Web App (Production Environment)
- V-model - Wikipedia
- Software Quality Assurance - Union of Japanese Scientists and Engineers
- Gemini API Documentation
- Cloudflare Pages Functions
- Prerendering and SEO - web.dev
- Marked.js - Markdown Parser
- Continuous Quality Improvement - Union of Japanese Scientists and Engineers
- Software Development Life Cycle (SDLC) - IPA