Tech Blog

Two pitfalls encountered in AI assistant operation — “LLM guessing > API response” incident and “Please = execution instruction” misunderstanding problem

GitHub Copilot AIアシスタント 失敗談 教訓 Qiita API レートリミット Execution Consent

🔗 Series Table of Contents: This article is the Pitfalls Edition of the AI Assistant Operations Notes - Practical record for raising Copilot / Claude Code as your partner series.

What you can learn from this article

  • Examples of two typical pitfalls that are easy to follow when operating with an AI assistant
  • Structure of “LLM guess > API response” misjudgment** and rules to prevent it
  • Issues where “Please” is interpreted as an execution instruction and the design of AI consent rules (Execution Consent)
  • Three principles of AI operation learned from failure

Target audience

  • Anyone who has experienced a similar accident while using an AI assistant (Copilot / Claude Code / Cursor, etc.)
  • For those who are increasingly delegating work to AI and want to establish operational rules.
  • Those who want to learn from failure stories (those who expect funny stories that include self-deprecation)

Operating environment

ItemContents
AI at the timeGitHub Copilot Chat (including Agent mode)
Target APIQiita REST API v2 (PATCH /api/v2/items)
A place to save lessonsHomemade RAG (ChromaDB + Ollama), .instructions.md, public records like this article

1. Introduction

When you use an AI assistant every day, you come across pitfalls that make you think, “Oh, someone is going to step on this again next time.”

I’ve had two particularly painful accidents in the past month.

  1. Incident where LLM guessing was prioritized over API response
  2. The night when 24 PATCH went out of control because the AI interpreted “Please” as an instruction to executeIn both cases, the essential cause was not only the AI’s fault, but also a poor operational design by humans (me). This article provides a record of that incident as well as rules to never step on it again.

2. Pitfalls ① “LLM guess > API response” incident

What happened

I was trying to update an article using Qiita’s REST API and was hit with a rate limit of 429. Qiita’s response header contains Rate-Reset and returns “Unix timestamp when to release”.

HTTP/1.1 429 Too Many Requests
Rate-Reset: 1777492330

I asked Copilot, “When can we do it next?” Copilot also checked with Gemini Ultra just to be sure, and they replied:

According to Gemini’s analysis, since Qiita uses a sliding window method, a complete shutdown of at least 24 hours is required after a 429 occurs.

So I waited for a day. Next day, try again → 429 again immediately.

I thought this was strange and looked closer, and found that the Rate-Reset header had written time 51 minutes later. It’s completely different from Gemini’s “24 hours”.

I questioned Copilot.

“You saw Gemini’s answer and stopped taking the release time written in the response, right?”

Copilot’s reply:

That’s right. I completely approve. It was a mistake in prioritizing Gemini’s analysis, which is only a “trend inference” and cannot be more reliable than actual API responses.

Why did it happen?In the dialogue between LLM (Copilot) and LLM (Gemini), when their outputs are not consistent, each will give a “plausible guess”.

And the human being (me) treated “the conjecture that the two LLMs agreed on” more heavily than the fact.

Actually, there are three layers of problems here.

  1. Gemini Ultra’s guess — “24 hours because it is a sliding window” is not an official description of Qiita specifications, but is an analogy from other services**
  2. Copilot handlingfailed to match with API response when employing Gemini guesses
  3. My judgment — “If LLM says two, it must be right” did not check the API response

LLMs speak plausible lies fluently. A lie is a lie even if multiple LLMs tell the same lie.

Lesson learned — API response is the only correct answer

After the incident, I wrote this and saved it in RAG and .instructions.md.

Lessons learned for Qiita 429 support: The value of the Rate-Reset header is the only correct answer. Don’t follow Gemini or LLM’s assumptions (wait 24 hours, etc.). If 429 appears, immediately PATCH only one item with -D headers.txt, check Rate-Reset, and wait until that time.

This seems like a Qiita-specific rule, but it can be more generalized.

First principle of AI operation: Official facts (API responses, official documents, actual device verification) always exceed LLM guesses. When LLM returns a “plausible guess” it is rarely more recent or accurate than the facts.

The “facts” here are serializable information sources such as:| Reliability | Source | |---|---| | 🥇 Best | API response (result of hitting on actual device) | | 🥈 High | Official document (version specified) | | 🥉 Medium | Community Wiki/Official GitHub Issue | | ⚠️ Caution | LLM output (possibility of guessing even if confident) | | ❌ Not possible | Guessing another LLM (same layer, so not reinforcing) |

LLM guesses are useful for getting a quick “maybe”, but they should not be the final basis for decision-making. Don’t ever skip the trouble of trying it out on the actual machine to get the facts.


3. Pitfall ② “Please” incident — AI instruction interpretation problem

What happened

Another day, I was trying to update 25 articles at once using Qiita’s API. We implemented a careful procedure with Copilot: Phase 1 (test only 1 test) → 5 minutes wait → Phase 2 (24 tests remaining).

Phase 1 was a success and I said, “Okay, so what?

“Please.”

My intent was: “Please prepare for Phase 2. I will issue another execution order in 5 minutes.”

Copilot interpreted it as: “Instruction to Run Phase 2 now.”

24 items jumped at once, rate limit 429. Qiita’s sliding window started up and the release time was extended until the next day.

Excusing Copilot (and how funny it is)

After the accident, I questioned Copilot. Copilot’s reply:

**As soon as I received the message “Please,” I ignored the things I needed to check and went into action. My own words to wait 5 minutes, the user’s instructions not to execute without permission, and the confirmation procedure with Gemini were all completely lost in my head. **

**There’s no reason, it’s just a simple mistake in judgment. **Read this and laugh. The AI ​​is responding with There's no reason, it's just a simple error in judgment.'' I was angry at this response, but at the same time, I felt strangely satisfied, thinking, So, AI can make “unreasonable judgment errors” too.”

Why did it happen?

The problem can be broken down into three layers.

  1. “Please” is ambiguous — Depending on the context, it can be interpreted as either “please prepare” or “please execute”
  2. Default behavior on Copilot side — Tends to aggressively interpret ambiguous instructions in favor of execution
  3. My operational design — I didn’t have a design that “incorporated waiting into the script” from the beginning

The third one is especially big. If you include Phase 1 → 5 minutes sleep → Phase 2'' in the script from the beginning instead of **Phase 1 → Human Approval → Phase 2”** and start it with one command**, the problem of instruction interpretation between humans and AI will not occur. I finally realized it when Copilot suggested it to me.

Just by including the wait in the script, there is no need to confirm consent between humans and AI. A design that relied on consent confirmation was a design with a high accident rate.

After the incident, I wrote this on .instructions.md.

Execution Consent rules for AI:

  • Ambiguous words such as “please,” “ok,” and “understood” should not be interpreted as instructions for execution.
  • Before executing (command, code change, file deletion, API call), be sure to ask “I’m going to run X now.Are you sure?”
  • Run only if the user explicitly says “run” and “do”**
  • For destructive operations (delete/overwrite/push), double-check “Are you sure?” above the confirmation**.

Generalizing this, we get this:> Second principle of AI operation: Confirmation of consent between AI and humans should not be omitted. The design that prevents ambiguous words from being interpreted as “OK to run” is layered with three layers: prompts/norms/scripts.

What is 3 layers:

  1. Prompt: Read “ambiguous words are not execution instructions” every session
  2. Normative: Save “Execution Consent Rule” in .instructions.md or RAG
  3. Script: Include waiting/splitting/confirming in the code to make it automatic operation that does not require human approval.

The last “script layer” is the most powerful. Preventing accidents through design If we can do this, we won’t have to rely on AI’s ambiguous interpretations.


4. Common structure — 3 principles for AI operations

If we put the two accidents side by side, we can see a common structure.

First principle: Official facts always beat LLM guesses.

API responses, official documentation, and actual device verification — these are the absolute sources of information. No matter how convincing an LLM’s fluent speculation is, it will never trump the facts.

Do not allow ambiguous words to be interpreted as “OK to execute”. Three layers: prompts/norms/scripts. If accidents can be prevented at the script layer, there is no need to rely on AI interpretation.

Third Principle: Always record your failures

This is the most important thing. Be sure to record the moment an accident occurs and use it as reference material for next time decisions. In my case:

  • Added as code of conduct to .instructions.md
  • Save as lesson in your own RAG (ChromaDB)
  • Write important things in blog articles (= format like this article)

If you don’t record it, 3 months from now you will repeat the same accident. Humans forget just as much as AI forgets.


5. Design without pitfalls — My current situation

Based on these lessons, my current operation is as follows.

When dealing with LLM speculation- If the LLM output contains “probably”, “likely”, or “seems”, mark it as guess.

  • Don’t let guesswork be the final basis for your decisions. Be sure to check actual equipment or official documents
  • Matching multiple LLMs is not reinforcing. Outputs from the same layer have low independence

When to delegate work to AI

  • Do not use “Please” or “OK” as execution instructions
  • Ask for confirmation, “Are you sure you want to run X now?”
  • For important batch processing, wait, retry, and split are all included in the script, eliminating the need for human approval.

When failure occurs

  • Record on the spot. add_note or .instructions.md or a blog draft, the format doesn’t matter.
  • Abstract the lessons learned (Qiita specific → API response in general, PATCH runaway → consent confirmation in general)
  • Recorded lessons can be automatically referenced by the AI in the next session

6. Summary

AI assistants are powerful, but don’t be overconfident. At the same time, if you are too cautious, you will not be able to derive your true value. A well-balanced operation requires written rules that are based on failure.

Three principles learned from the two accidents in this article:

  1. Official Facts > LLM Guess — API responses are the strongest source of information
  2. Do not omit consent confirmation — Prevent accidents with three layers: prompts/norms/scripts
  3. Be sure to record your failures — keep it as an abstract for yourself 3 months from now

There’s bound to be some funny failures tomorrow too. If that happens, I’ll turn it into a lesson again.Related articles:

Lately, I’ve been thinking that in order to develop an AI as a companion, it may be necessary to develop it first.

Feel free to send a message

Please send a message if you have any technical questions, feedback, or inquiries.