The Necessity of AI Quality Assurance - Why AI Needs AI-Powered Checks
- 3 reasons why AI cannot self-check
- The rationale behind why external AI checks are extremely important
- The 3-stage quality assurance process
- Practical example from Article 51 (Iterative improvement 6/10 → 8/10)
1. AI Cannot Recognize Its Own Mistakes
In the previous article (Article 51: Limitations of AI Judgment), we analyzed 7 oversights in AI development and clarified the limitations of upstream AI's judgment. A critical issue that emerged was the fact that "current general AI models cannot recognize their own mistakes."
- Initial article quality: 6/10 (needs_improvement)
- Major Issues: 4 items (lack of quantitative basis, title issue, subjectivity, insufficient explanation of technical terms)
- However, the upstream AI that created it did not notice these issues
Why can't AI recognize its own mistakes? Understanding this problem is the first step towards establishing an appropriate quality assurance system.
- Upstream AI: AI responsible for initial stages such as requirements definition and design. Excels at structured planning but is weak in detailed implementation and feasibility verification.
- Downstream AI: AI responsible for concrete code implementation and detailed specification development. Receives designs from upstream AI and creates actual working deliverables.
- Confidence Bias: The tendency for AI to have high confidence in its own judgments and not to question them. This causes difficulty in noticing mistakes.
- Context Dependency: The characteristic of AI being able to make judgments only within the scope of the given information (prompts or conversation history), and being unable to consider external perspectives (reader's perspective, future impact, etc.).
2. Why AI Cannot Self-Check - 3 Reasons
Reason 1: Confidence Bias
AI tends to have high confidence in its own judgments. Once it decides "this implementation is fine," it cannot question that judgment.
Upstream AI used the phrase "quantitative evaluation," but it was actually a subjective age conversion. However, the AI itself did not notice this contradiction. It was only through the feedback of external AI (Gemini) that the weight of "quantitative" was understood.
Reason 2: Context Dependency
Current general AI models can only make judgments within the given context. It is difficult for them to consider issues outside the context (reader's perspective, SEO, future impact, etc.).
Upstream AI titled it "Quantitative Evaluation of 7 AI Oversights," but internal proper nouns are incomprehensible to general readers. This lack of a reader's perspective is a typical example of context dependency.
Reason 3: Lack of Self-Evaluation
Current general AI models lack objective evaluation criteria. While they can make a judgment like "this is good," they don't have criteria for "why is it good" or "how much should it be improved?"
3. The Necessity of External AI Checks
For the three reasons above, checks from an independent perspective are extremely important for AI-generated deliverables. To address this, we introduced the Gemini QA Framework.
It is a quality assurance system independently built through user initiative. It uses Google's Gemini 2.5 Flash API (an AI API provided by Google) to evaluate articles from a critical perspective.
Development Background: To reinforce the limitations of upstream AI's judgment (junior high school level) revealed in Article 51, a user (a former software development engineer) proposed utilizing the Gemini API and built it independently.
Evaluation Criteria: We implement a 3-stage status management system, evaluating on a 10-point scale: 8 points or more for approved, 6-7 points for conditional_approval, and 5 points or less for rejected.
Demonstration in Article 51
- Initial Evaluation: 6/10 (4 Major Issues)
- 1st Revision: 7/10 (Major Issues reduced to 1)
- 2nd to 7th Revisions: Title change, removal of quantitative expressions, adjustment of ★ rating (5-star importance rating)
- Final Evaluation: 8/10 (approved, 0 Major Issues)
Improvement Result: 6/10 → 8/10 (2-point increase)
This result demonstrates the effectiveness of external AI checks and iterative improvement.
4. AI Development Quality Assurance: 3-Stage Process
In current AI development, quality is ensured through a 3-stage process:
Phase 1: Upstream AI - Initial Implementation
- Role: Requirements definition, design, initial implementation
- Typical Quality: 6/10 level (needs_improvement)
- Characteristic: Cannot recognize problems due to confidence bias
Phase 2: Downstream AI - Improvement Proposal
- Role: Code implementation, addition of specific examples, initial improvements
- Typical Quality: 7/10 level (conditional_approval)
- Characteristic: Receives deliverables from upstream AI and improves implementation aspects
Phase 3: External AI (Gemini) - Critical Evaluation
- Role: Critical evaluation from an independent perspective
- Target Quality: 8/10 or higher (approved)
- Characteristic: Consistency check based on conversation history, detection of logical contradictions, evaluation from a reader's perspective
- Phase 1 → Phase 2: Downstream AI receives the design document and initial implementation created by upstream AI and performs concrete code implementation and improvements.
- Phase 2 → Phase 3: Gemini critically evaluates the deliverables (articles or code) implemented by downstream AI and points out problems.
- Phase 3 → Phase 2 (Iteration): Based on Gemini's feedback, downstream AI makes corrections. This iterates until approval.
Important: These three stages allow us to mutually compensate for the problems of confidence bias and context dependency in a single AI.
Each AI has its strengths:
- Upstream AI: Excels at structured design
- Downstream AI: Excels at concrete implementation
- Gemini: Excels at objective evaluation
5. Practical Application of Gemini QA Framework
Evaluation Criteria
The Gemini QA Framework evaluates articles from the following perspectives:
- Accuracy: Is there any exaggeration or false statement?
- Clarity: Are the title and headings appropriate?
- Logicality: Is there logical consistency?
- Completeness: Is there insufficient explanation or unexplained technical terms?
- Consistency with Conversation History: Does it follow user instructions?
Specific Feedback Examples from Article 51
- 4th Evaluation: Title change instruction violation (ignored instruction to change an internal proper noun to a general noun) → rejected
- 7th Evaluation: Logical contradiction in ★ rating (importance rating) (inconsistent number of ★ between list and individual evaluations) → rejected
- Updated conversation history, explicitly stated title change
- Reviewed all ★ ratings (importance ratings) to ensure logical consistency
- Result: Achieved 8/10 (approved)
6. Summary - Quality Assurance in the AI Era
In AI development, quality checks by external AI are extremely important. The reasons are:
- AI cannot self-check: Confidence bias, context dependency, lack of self-evaluation
- An independent perspective is necessary: To discover problems invisible to a single AI
- Iterative improvement is effective: In Article 51, 6/10 → 8/10 was achieved through 8 steps of evaluation and correction.
- Automation of Gemini QA Framework
- Further clarification of evaluation criteria
- Expansion of application to other development phases
- Gemini's own limitations: Gemini is not perfect either, and there is a risk of model-specific biases and misjudgments.
- Cost: Use of paid API incurs a cost of approximately 12 JPY per article (assuming 8 iterations).
- Time: Iterative improvement takes time and is not suitable for urgent corrections.
- Ambiguity of criteria: The criteria for what constitutes "approval" are not yet fully clarified.
Even as AI technology evolves, recognizing AI's limitations and establishing an appropriate checking system is one of the important factors in producing high-quality deliverables.