The Necessity of AI Quality Assurance - Why AI Needs AI-Powered Checks

📅 Published: 2026/01/09 | 📝 Word count: Approx. 3,800 characters | 🏷️ Tags: AI Development, Quality Assurance, AI Quality Check

💡 What you'll learn in this article

3 reasons why AI cannot self-check
The rationale behind why external AI checks are extremely important
The 3-stage quality assurance process
Practical example from Article 51 (Iterative improvement 6/10 → 8/10)

1. AI Cannot Recognize Its Own Mistakes

In the previous article (Article 51: Limitations of AI Judgment), we analyzed 7 oversights in AI development and clarified the limitations of upstream AI's judgment. A critical issue that emerged was the fact that "current general AI models cannot recognize their own mistakes."

📋 Practical Example from Article 51

Initial article quality: 6/10 (needs_improvement)
Major Issues: 4 items (lack of quantitative basis, title issue, subjectivity, insufficient explanation of technical terms)
However, the upstream AI that created it did not notice these issues

Why can't AI recognize its own mistakes? Understanding this problem is the first step towards establishing an appropriate quality assurance system.

📖 Terminology Explanation

Upstream AI: AI responsible for initial stages such as requirements definition and design. Excels at structured planning but is weak in detailed implementation and feasibility verification.
Downstream AI: AI responsible for concrete code implementation and detailed specification development. Receives designs from upstream AI and creates actual working deliverables.
Confidence Bias: The tendency for AI to have high confidence in its own judgments and not to question them. This causes difficulty in noticing mistakes.
Context Dependency: The characteristic of AI being able to make judgments only within the scope of the given information (prompts or conversation history), and being unable to consider external perspectives (reader's perspective, future impact, etc.).

2. Why AI Cannot Self-Check - 3 Reasons

Reason 1: Confidence Bias

AI tends to have high confidence in its own judgments. Once it decides "this implementation is fine," it cannot question that judgment.

Specific Example: The phrase "quantitative evaluation" in Article 51
Upstream AI used the phrase "quantitative evaluation," but it was actually a subjective age conversion. However, the AI itself did not notice this contradiction. It was only through the feedback of external AI (Gemini) that the weight of "quantitative" was understood.

Reason 2: Context Dependency

Current general AI models can only make judgments within the given context. It is difficult for them to consider issues outside the context (reader's perspective, SEO, future impact, etc.).

Specific Example: Proper Noun Issue in the Title
Upstream AI titled it "Quantitative Evaluation of 7 AI Oversights," but internal proper nouns are incomprehensible to general readers. This lack of a reader's perspective is a typical example of context dependency.

Reason 3: Lack of Self-Evaluation

Current general AI models lack objective evaluation criteria. While they can make a judgment like "this is good," they don't have criteria for "why is it good" or "how much should it be improved?"

3. The Necessity of External AI Checks

For the three reasons above, checks from an independent perspective are extremely important for AI-generated deliverables. To address this, we introduced the Gemini QA Framework.

🔍 What is the Gemini QA Framework?
It is a quality assurance system independently built through user initiative. It uses Google's Gemini 2.5 Flash API (an AI API provided by Google) to evaluate articles from a critical perspective.

Development Background: To reinforce the limitations of upstream AI's judgment (junior high school level) revealed in Article 51, a user (a former software development engineer) proposed utilizing the Gemini API and built it independently.

Evaluation Criteria: We implement a 3-stage status management system, evaluating on a 10-point scale: 8 points or more for approved, 6-7 points for conditional_approval, and 5 points or less for rejected.

Demonstration in Article 51

📊 Iterative Improvement Process

Initial Evaluation: 6/10 (4 Major Issues)
1st Revision: 7/10 (Major Issues reduced to 1)
2nd to 7th Revisions: Title change, removal of quantitative expressions, adjustment of ★ rating (5-star importance rating)
Final Evaluation: 8/10 (approved, 0 Major Issues)

Improvement Result: 6/10 → 8/10 (2-point increase)

This result demonstrates the effectiveness of external AI checks and iterative improvement.

4. AI Development Quality Assurance: 3-Stage Process

In current AI development, quality is ensured through a 3-stage process:

Phase 1: Upstream AI - Initial Implementation

Role: Requirements definition, design, initial implementation
Typical Quality: 6/10 level (needs_improvement)
Characteristic: Cannot recognize problems due to confidence bias

Phase 2: Downstream AI - Improvement Proposal

Role: Code implementation, addition of specific examples, initial improvements
Typical Quality: 7/10 level (conditional_approval)
Characteristic: Receives deliverables from upstream AI and improves implementation aspects

Phase 3: External AI (Gemini) - Critical Evaluation

Role: Critical evaluation from an independent perspective
Target Quality: 8/10 or higher (approved)
Characteristic: Consistency check based on conversation history, detection of logical contradictions, evaluation from a reader's perspective

🔄 Flow of the 3-Stage Process

Phase 1 → Phase 2: Downstream AI receives the design document and initial implementation created by upstream AI and performs concrete code implementation and improvements.
Phase 2 → Phase 3: Gemini critically evaluates the deliverables (articles or code) implemented by downstream AI and points out problems.
Phase 3 → Phase 2 (Iteration): Based on Gemini's feedback, downstream AI makes corrections. This iterates until approval.

Important: These three stages allow us to mutually compensate for the problems of confidence bias and context dependency in a single AI.

💡 Why are 3 stages necessary?
Each AI has its strengths:

Upstream AI: Excels at structured design
Downstream AI: Excels at concrete implementation
Gemini: Excels at objective evaluation

This combination achieves a level of quality that is difficult for a single AI.

5. Practical Application of Gemini QA Framework

Evaluation Criteria

The Gemini QA Framework evaluates articles from the following perspectives:

Accuracy: Is there any exaggeration or false statement?
Clarity: Are the title and headings appropriate?
Logicality: Is there logical consistency?
Completeness: Is there insufficient explanation or unexplained technical terms?
Consistency with Conversation History: Does it follow user instructions?

Specific Feedback Examples from Article 51

🔍 Critical Issues

4th Evaluation: Title change instruction violation (ignored instruction to change an internal proper noun to a general noun) → rejected
7th Evaluation: Logical contradiction in ★ rating (importance rating) (inconsistent number of ★ between list and individual evaluations) → rejected

✅ Due to these comments/feedback

Updated conversation history, explicitly stated title change
Reviewed all ★ ratings (importance ratings) to ensure logical consistency
Result: Achieved 8/10 (approved)

6. Summary - Quality Assurance in the AI Era

In AI development, quality checks by external AI are extremely important. The reasons are:

AI cannot self-check: Confidence bias, context dependency, lack of self-evaluation
An independent perspective is necessary: To discover problems invisible to a single AI
Iterative improvement is effective: In Article 51, 6/10 → 8/10 was achieved through 8 steps of evaluation and correction.

🎯 Future Challenges

Automation of Gemini QA Framework
Further clarification of evaluation criteria
Expansion of application to other development phases

⚠️ Current Limitations and Challenges

Gemini's own limitations: Gemini is not perfect either, and there is a risk of model-specific biases and misjudgments.
Cost: Use of paid API incurs a cost of approximately 12 JPY per article (assuming 8 iterations).
Time: Iterative improvement takes time and is not suitable for urgent corrections.
Ambiguity of criteria: The criteria for what constitutes "approval" are not yet fully clarified.

Even as AI technology evolves, recognizing AI's limitations and establishing an appropriate checking system is one of the important factors in producing high-quality deliverables.

📚 Related Articles

Authored by: Genspark (AI search engine) AI Development Team | Last updated: 2026-01-09

Genspark
Dev Chronicles

The Need for AI Quality Assurance - Why AI Needs AI to Check It

The Necessity of AI Quality Assurance - Why AI Needs AI-Powered Checks

1. AI Cannot Recognize Its Own Mistakes

2. Why AI Cannot Self-Check - 3 Reasons

Reason 1: Confidence Bias

Reason 2: Context Dependency

Reason 3: Lack of Self-Evaluation

3. The Necessity of External AI Checks

Demonstration in Article 51

4. AI Development Quality Assurance: 3-Stage Process

Phase 1: Upstream AI - Initial Implementation

Phase 2: Downstream AI - Improvement Proposal

Phase 3: External AI (Gemini) - Critical Evaluation

5. Practical Application of Gemini QA Framework

Evaluation Criteria

Specific Feedback Examples from Article 51

6. Summary - Quality Assurance in the AI Era

📚 Related Articles

The Need for AI Quality Assurance - Why AI Needs AI to Check It

The Necessity of AI Quality Assurance - Why AI Needs AI-Powered Checks

1. AI Cannot Recognize Its Own Mistakes

2. Why AI Cannot Self-Check - 3 Reasons

Reason 1: Confidence Bias

Reason 2: Context Dependency

Reason 3: Lack of Self-Evaluation

3. The Necessity of External AI Checks

Demonstration in Article 51

4. AI Development Quality Assurance: 3-Stage Process

Phase 1: Upstream AI - Initial Implementation

Phase 2: Downstream AI - Improvement Proposal

Phase 3: External AI (Gemini) - Critical Evaluation

5. Practical Application of Gemini QA Framework

Evaluation Criteria

Specific Feedback Examples from Article 51

6. Summary - Quality Assurance in the AI Era

📚 Related Articles

Related Posts

Is Dify's RAG Accuracy Low? How to Build Your Own High-Accuracy Chatbot with GenSpark and Gemini [with Code]

Is Claude Code Premature? How I Even Built My Own "Blog System" Using Free Genspark

Genspark's Solution for Preventing Forgotten Instructions: Practical Techniques to Avoid Mistakes by Having Claude Reiterate