Gemini QA Framework Built with Genspark - AI Quality Assurance

1. Introduction: The Necessity of AI Quality Assurance

While developing web applications, we repeatedly faced situations where articles and specification documents required multiple revisions. I, as the user, had to manually check the content generated by Eric (the AI responsible for upstream processes), point out issues, and request corrections—this process was very time-consuming and inefficient.

As revealed in Article 53-46, Eric's judgment had limitations (Article 51: Eric's Judgment Test), making third-party checks by the user indispensable. However, manually checking everything was not realistic.

Therefore, we built the Gemini QA Framework. This framework leverages the Gemini API (Google's generative AI model) to automatically check the quality of articles and specifications, point out issues, and send them back for revision. By having Gemini automate the checking tasks that I would originally perform, we significantly reduced development man-hours.

2. Development Background: Why the Gemini QA Framework Was Necessary

2-1. Issues Revealed in Article 53-46

The following issues became clear in the Article 53-46 series:

  • Eric's Judgment Limitations: Experiments in Article 51 revealed that Eric's judgment level was equivalent to that of a middle school student. As an AI responsible for upstream processes (design and requirements definition), this level of judgment was insufficient.
  • User Check Burden: I had to review each article and specification document generated by Eric and issue correction instructions. Sometimes, 4 to 5 revisions were necessary per article.
  • Quality Variability: Eric's output quality was inconsistent, sometimes containing incorrect information or logical contradictions.

2-2. Gemini's Quality Assurance Capability Demonstrated in Article 52

In Article 52 (The Importance of Quality Assurance in AI Development), it became clear that critical evaluation using the Gemini API was highly effective. Gemini:

  • Pointed out logical contradictions in Eric's analysis results
  • Identified insufficient evidence and conclusions based on speculation
  • Identified issues before implementation, preventing rework

Following this success, we decided to structure the Gemini API into a framework to automatically perform quality checks at all phases of development.

3. Overview of the Gemini QA Framework

💡 Purpose of the Gemini QA Framework
The primary goal of this framework is to reduce user man-hours for quality checks. Gemini automatically performs tasks that the user would originally have to check repeatedly and issue correction instructions. This allows users to focus on the essential aspects of development.

3-1. Main Features

  • Automatic Quality Checks: Automatically evaluates deliverables such as articles, specification documents, and design documents.
  • Critical Evaluation Mode: Sets the Gemini API temperature to 0.3 for a cautious evaluation.
  • 6 Phase-Specific Quality Criteria: Evaluates based on quality criteria appropriate for each phase—requirements definition, specification creation, problem analysis, design, post-testing, and pre-deployment.
  • Automatic Phase Detection: Automatically determines the current phase based on keywords and applies appropriate evaluation criteria.
  • Automatic Conversation History Retrieval (implemented in v2): Automatically retrieves the entire current conversation session to perform context-aware evaluation.

3-2. Evaluation Criteria

The Gemini QA Framework evaluates deliverables from the following perspectives:

  1. Consistency with Conversation History (Most Important): Checks for contradictions with user instructions and past conversation content.
  2. Logical Consistency: Checks if the content is free of contradictions and logically consistent.
  3. Explanation of Technical Terms: Checks if new technical terms are adequately explained.
  4. Reader's Perspective: Checks if the content is understandable for beginners and if sufficient examples are provided.
  5. SEO Optimization: Checks if keywords are included in the title (meta descriptions are not evaluated).

3-3. Output Format

Gemini outputs evaluation results in the following JSON format:

{ "overall_quality": "excellent|good|acceptable|needs_improvement|poor", "quality_score": 1-10, "approval_status": "approved|conditional_approval|rejected", "critical_issues": [...], "major_issues": [...], "minor_issues": [...], "summary": "Summary of overall evaluation" }

If approval_status is rejected, corrections are required. If conditional_approval, it's a conditional approval; approved indicates official approval.

4. Implementation Details

4-1. Script Structure

The Gemini QA Framework is implemented as Bash scripts:

  • article_quality_check_critical_fixed.sh (Initial version): Manually specifies the conversation history file.
  • article_quality_check_auto_v2.sh (Improved version): Automatically retrieves the entire current conversation session (recommended).

4-2. Automatic Conversation History Retrieval (Important Improvement)

In the v2 script, we significantly improved how conversation history is handled:

  • User utterances are kept in full: User instructions and requests are preserved verbatim.
  • AI assistant utterances are summarized: Long responses and redundant parts are summarized to reduce token consumption for the Gemini API.
  • Inheritance from previous session: Previous conversation history files can optionally be specified.

This mechanism allows Gemini to understand the overall flow of article creation and perform evaluations that accurately reflect the user's latest instructions (e.g., "Please delete the description of Eric George").

4-3. Utilization of the Gemini API

The Gemini QA Framework uses Gemini 2.5 Flash:

  • Model: gemini-2.5-flash
  • Temperature: 0.3 (set low for cautious evaluation)
  • Output format: JSON

Bash scripts are executed in Genspark's sandbox environment, calling the Gemini API with the curl command.

4-4. Phase Auto-Detection Logic

The script automatically determines the phase based on keywords:

Detection Examples:
  • “Test,” “test” → testing
  • “Function,” “class” → implementation
  • “Problem,” “root cause” → analysis
  • “Specification document,” “API specification” → specification
  • “Design,” “architecture” → design
  • “Request,” “requirement” → requirements

If the phase cannot be identified, it is evaluated using general quality criteria.

5. Operation and Effects: Specific Rejection Cases

5-1. Fortune-telling App Development v2.20 Case Study (Pre-discovery of Insufficient Testing)

During the development of web app v2.20, "Blog Article Generation Optimization," Gemini pointed out the following issues:

Gemini QA Results:
  • Overall Quality: needs_improvement
  • Test Coverage Score: 4/10
  • Critical Issues: 3 items
    1. Unclear achievement of performance targets
    2. Error case testing not implemented (concerns about stable operation)
    3. Unclear intent behind changing article count from 90 to 30
  • Untested Items: 6 items
    • Measurement of CPU time during actual Cron execution
    • Verification of stable operation over a long period (1 week)
    • Error case testing
    • Load testing
    • Security testing
    • Quality evaluation of article generation

Following Gemini's feedback, a correction plan (v2.20_correction_plan.md) was created, and the following improvements were implemented:

  • CPU time reduction: 1500~3000ms → 500~800ms (approx. 67% reduction)
  • Success rate improvement: 50% → 90% or more
  • Clarification of change reason: Clearly stated that the change from 3 articles per day to 1 article per day was to circumvent the Cloudflare Workers Free plan's CPU time limit.
User Man-Hour Reduction Effect:
Gemini proactively identified insufficient testing and potential issues that I had not noticed. This prevented rework after implementation and issues in the production environment.

5-2. ERIC Analysis Quality Assurance Review (Blocked Implementation)

ERIC's initial analysis attributed the Gemini API's 503 error to a "temporary failure." However, the Gemini QA Framework pointed out the following issues:

  • Insufficient root cause identification (analysis based on speculation)
  • Lack of evidentiary data
  • Insufficient review of specifications and documentation

Gemini's final evaluation was "Unimplementable" (Overall Rating: ⭐ Immature). This feedback prevented implementation in a faulty state and revealed the need for further investigation and correction.

User Man-Hour Reduction Effect:
By Gemini detecting issues before implementation, rework costs after implementation were significantly reduced.

5-3. Case Study in Article Creation (Quality Improvement of This Blog Post)

The Gemini QA Framework is also active in creating articles for this blog, "Genspark Development Struggle." For Article 51: Genspark and Gemini API Selection - Practical AI Development Environment, Gemini performed a total of 4 rejections, gradually improving quality:

Check Count Quality Score Approval Status Feedback Content
1st Check 4/10 rejected 2 Critical Issues (insufficient emphasis on Eric's limitations, insufficient explanation of contradictions)
2 Major Issues (insufficient explanation of technical terms, lack of logical consistency)
2nd Check 4/10 rejected Important recommendation box remained despite deletion instruction, deletion of Eric George's description incomplete
3rd Check 6/10 conditional_approval 1 Major Issue (insufficient explanation of Jupyter Notebook and prompt engineering)
4th Check 7/10 conditional_approval 1 Major Issue (insufficient explanation of Cloudflare Workers)
5th Check 8/10 approved ✅ Only 2 Minor Issues, Approved
User Man-Hour Reduction Effect:
Originally, I would have needed to read through the article four times, point out issues, and issue correction instructions. By Gemini automatically performing this task, my man-hours were significantly reduced.

6. Summary: Significance of the Gemini QA Framework

The Gemini QA Framework is a mechanism that automates quality assurance in AI development and significantly reduces user man-hours.

  • Fortune-telling App Development: Pre-discovery of insufficient testing, 67% reduction in CPU time, 50%→90% improvement in success rate.
  • ERIC Analysis: Reduced rework costs by detecting issues before implementation.
  • Article Creation: Quality score improved from 4/10 to 8/10 after 4 rejections.
  • User Man-Hours: Gemini automates quality check tasks that the user would originally perform.

This framework was realized by combining Genspark's sandbox environment with the Gemini API. The choice of development environment, introduced in Article 51, made the construction of this framework possible.

Going forward, we will further leverage the Gemini QA Framework across more development phases to improve the quality and efficiency of AI development.