Skip to content

fix: cap embedding batch size to provider limit and sanitize lone surrogates#8928

Open
zhangli091011 wants to merge 1 commit into
AstrBotDevs:masterfrom
zhangli091011:fix/kb-embedding-batch-and-surrogate
Open

fix: cap embedding batch size to provider limit and sanitize lone surrogates#8928
zhangli091011 wants to merge 1 commit into
AstrBotDevs:masterfrom
zhangli091011:fix/kb-embedding-batch-and-surrogate

Conversation

@zhangli091011

@zhangli091011 zhangli091011 commented Jun 20, 2026

Copy link
Copy Markdown

Summary

Fixes two bugs that cause PDF document upload failures in the knowledge base.

Bug 1: Embedding batch size exceeds provider limit

DashScope (Alibaba Cloud Bailian) embedding API rejects requests with batch size > 10, but the default batch_size was 32. This caused InternalError.Algo.InvalidParameter: batch size is invalid, it should not be larger than 10.

Fix: Added max_batch_size property to EmbeddingProvider base class (default 10), and get_embeddings_batch now automatically caps the batch size to this limit.

Bug 2: Lone surrogate characters break UTF-8 encoding

PDF-parsed text can contain isolated UTF-16 surrogates (e.g., \ud83d from broken emoji codepoints) which cannot be UTF-8 encoded, causing 'utf-8' codec can't encode character '\ud83d' when sending to the embedding API.

Fix: Sanitize text chunks with encode('utf-8', errors='replace').decode('utf-8') before passing them to the embedding pipeline.

Changes

File Change
astrbot/core/provider/provider.py Add max_batch_size property + logger import; enforce cap in get_embeddings_batch
astrbot/core/knowledge_base/kb_helper.py Sanitize lone surrogates from parsed text chunks

Testing

  • Reproduced the original errors by uploading PDF files with the DashScope embedding provider
  • After the fix, both files uploaded successfully
  • Verified that other embedding providers (OpenAI-compatible) are unaffected — max_batch_size can be overridden by subclasses if needed

Summary by Sourcery

Cap embedding batch size to provider-specific limits and sanitize invalid text chunks to prevent knowledge base upload failures.

Bug Fixes:

  • Prevent embedding requests from exceeding provider-imposed batch size limits by capping the batch_size value per provider configuration.
  • Avoid UTF-8 encoding errors during document upload by sanitizing lone surrogate characters in extracted text before sending to the embedding pipeline.

Enhancements:

  • Introduce a configurable max_batch_size setting for embedding providers with sensible defaults across built-in configs, including DashScope/Alibaba Cloud Bailian.
  • Clarify document upload batch size behavior in the dashboard by updating the batch size field hint text to reference provider max_batch_size limits.

@dosubot dosubot Bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Jun 20, 2026
@vercel

vercel Bot commented Jun 20, 2026

Copy link
Copy Markdown

@zhangli091011 is attempting to deploy a commit to the soulter's projects Team on Vercel.

A member of the Team first needs to authorize it.

@dosubot dosubot Bot added area:provider The bug / feature is about AI Provider, Models, LLM Agent, LLM Agent Runner. feature:knowledge-base The bug / feature is about knowledge base labels Jun 20, 2026

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • Consider making the default max_batch_size less restrictive (e.g., matching current defaults) and overriding it only in providers like DashScope to avoid unintentionally reducing throughput for all existing providers.
  • The UTF-8 sanitization in upload_document silently replaces invalid characters; if this is a concern, you might want to centralize this into a helper that can optionally log or count replacements so issues with upstream text extraction can be detected.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Consider making the default `max_batch_size` less restrictive (e.g., matching current defaults) and overriding it only in providers like DashScope to avoid unintentionally reducing throughput for all existing providers.
- The UTF-8 sanitization in `upload_document` silently replaces invalid characters; if this is a concern, you might want to centralize this into a helper that can optionally log or count replacements so issues with upstream text extraction can be detected.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces UTF-8 sanitization for document uploads to handle lone surrogates and adds a default max_batch_size property (set to 10) to limit batch sizes in embedding API calls. The reviewer pointed out that setting the default limit to 10 in the base class could cause performance regressions for other providers (like OpenAI) and suggested making it configurable or using a higher default value.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread astrbot/core/provider/provider.py Outdated
Comment on lines +342 to +352
@property
def max_batch_size(self) -> int:
"""Maximum batch size per single embedding API call.

Subclasses may override this when the backend enforces a more restrictive
limit than the default (e.g., DashScope/Alibaba Cloud limits to 10).

Returns:
The maximum number of texts per batch.
"""
return 10

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Capping the default max_batch_size to 10 in the base EmbeddingProvider class will cause a significant performance regression for all other embedding providers (such as OpenAI, Ollama, Gemini, etc.) that do not override this property. For instance, OpenAI supports up to 2048 texts per batch, but will now be restricted to 10, resulting in many more API requests and potential rate-limiting issues.

To resolve this, we can make max_batch_size configurable via the provider's configuration (e.g., self.provider_config) and default to a much more reasonable value like 100 or 2048.

Suggested change
@property
def max_batch_size(self) -> int:
"""Maximum batch size per single embedding API call.
Subclasses may override this when the backend enforces a more restrictive
limit than the default (e.g., DashScope/Alibaba Cloud limits to 10).
Returns:
The maximum number of texts per batch.
"""
return 10
@property
def max_batch_size(self) -> int:
"""Maximum batch size per single embedding API call.
Subclasses may override this when the backend enforces a more restrictive
limit than the default (e.g., DashScope/Alibaba Cloud limits to 10).
Returns:
The maximum number of texts per batch.
"""
return self.provider_config.get("max_batch_size", 100)

@zhangli091011 zhangli091011 force-pushed the fix/kb-embedding-batch-and-surrogate branch from a5b22f6 to a4012e7 Compare June 20, 2026 17:48
@dosubot dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Jun 20, 2026
…rogates

- Add max_batch_size property to EmbeddingProvider base class, reading
  from provider_config (default 100). Providers with stricter limits
  (e.g. DashScope = 10) set it in their config.
- get_embeddings_batch enforces this cap before splitting batches.
- Add max_batch_size to all embedding provider default templates, and
  its description/hint to the provider source config metadata schema.
- Update DocumentsTab.vue upload batch_size hint to mention provider cap.
- Sanitize lone surrogates from PDF-parsed text chunks that would
  otherwise cause UTF-8 encoding failures during embedding API calls.
@zhangli091011 zhangli091011 force-pushed the fix/kb-embedding-batch-and-surrogate branch from a4012e7 to 512eb5f Compare June 20, 2026 18:06
@zhangli091011

Copy link
Copy Markdown
Author

@sourcery-ai review

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • In get_embeddings_batch, silently capping batch_size at max_batch_size and only logging at debug level may make it hard to detect misconfiguration in production; consider logging at a higher level or validating and clamping batch_size earlier (e.g., when reading the config or initializing the provider).
  • On the dashboard, the batch size field now mentions the provider max_batch_size limit in the hint, but still allows arbitrary numbers; consider constraining or auto-adjusting the UI value based on the current provider’s max_batch_size to avoid user confusion.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `get_embeddings_batch`, silently capping `batch_size` at `max_batch_size` and only logging at debug level may make it hard to detect misconfiguration in production; consider logging at a higher level or validating and clamping `batch_size` earlier (e.g., when reading the config or initializing the provider).
- On the dashboard, the batch size field now mentions the provider `max_batch_size` limit in the hint, but still allows arbitrary numbers; consider constraining or auto-adjusting the UI value based on the current provider’s `max_batch_size` to avoid user confusion.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:provider The bug / feature is about AI Provider, Models, LLM Agent, LLM Agent Runner. feature:knowledge-base The bug / feature is about knowledge base size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant