fix: cap embedding batch size to provider limit and sanitize lone surrogates by zhangli091011 · Pull Request #8928 · AstrBotDevs/AstrBot

zhangli091011 · 2026-06-20T17:26:04Z

Summary

Fixes two bugs that cause PDF document upload failures in the knowledge base.

Bug 1: Embedding batch size exceeds provider limit

DashScope (Alibaba Cloud Bailian) embedding API rejects requests with batch size > 10, but the default batch_size was 32. This caused InternalError.Algo.InvalidParameter: batch size is invalid, it should not be larger than 10.

Fix: Added max_batch_size property to EmbeddingProvider base class (default 10), and get_embeddings_batch now automatically caps the batch size to this limit.

Bug 2: Lone surrogate characters break UTF-8 encoding

PDF-parsed text can contain isolated UTF-16 surrogates (e.g., \ud83d from broken emoji codepoints) which cannot be UTF-8 encoded, causing 'utf-8' codec can't encode character '\ud83d' when sending to the embedding API.

Fix: Sanitize text chunks with encode('utf-8', errors='replace').decode('utf-8') before passing them to the embedding pipeline.

Changes

File	Change
`astrbot/core/provider/provider.py`	Add `max_batch_size` property + `logger` import; enforce cap in `get_embeddings_batch`
`astrbot/core/knowledge_base/kb_helper.py`	Sanitize lone surrogates from parsed text chunks

Testing

Reproduced the original errors by uploading PDF files with the DashScope embedding provider
After the fix, both files uploaded successfully
Verified that other embedding providers (OpenAI-compatible) are unaffected — max_batch_size can be overridden by subclasses if needed

Summary by Sourcery

Cap embedding batch size to provider-specific limits and sanitize invalid text chunks to prevent knowledge base upload failures.

Bug Fixes:

Prevent embedding requests from exceeding provider-imposed batch size limits by capping the batch_size value per provider configuration.
Avoid UTF-8 encoding errors during document upload by sanitizing lone surrogate characters in extracted text before sending to the embedding pipeline.

Enhancements:

Introduce a configurable max_batch_size setting for embedding providers with sensible defaults across built-in configs, including DashScope/Alibaba Cloud Bailian.
Clarify document upload batch size behavior in the dashboard by updating the batch size field hint text to reference provider max_batch_size limits.

vercel · 2026-06-20T17:26:09Z

@zhangli091011 is attempting to deploy a commit to the soulter's projects Team on Vercel.

A member of the Team first needs to authorize it.

sourcery-ai

Hey - I've left some high level feedback:

Consider making the default max_batch_size less restrictive (e.g., matching current defaults) and overriding it only in providers like DashScope to avoid unintentionally reducing throughput for all existing providers.
The UTF-8 sanitization in upload_document silently replaces invalid characters; if this is a concern, you might want to centralize this into a helper that can optionally log or count replacements so issues with upstream text extraction can be detected.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- Consider making the default `max_batch_size` less restrictive (e.g., matching current defaults) and overriding it only in providers like DashScope to avoid unintentionally reducing throughput for all existing providers.
- The UTF-8 sanitization in `upload_document` silently replaces invalid characters; if this is a concern, you might want to centralize this into a helper that can optionally log or count replacements so issues with upstream text extraction can be detected.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

gemini-code-assist

Code Review

This pull request introduces UTF-8 sanitization for document uploads to handle lone surrogates and adds a default max_batch_size property (set to 10) to limit batch sizes in embedding API calls. The reviewer pointed out that setting the default limit to 10 in the base class could cause performance regressions for other providers (like OpenAI) and suggested making it configurable or using a higher default value.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-20T17:27:29Z

+    @property
+    def max_batch_size(self) -> int:
+        """Maximum batch size per single embedding API call.
+
+        Subclasses may override this when the backend enforces a more restrictive
+        limit than the default (e.g., DashScope/Alibaba Cloud limits to 10).
+
+        Returns:
+            The maximum number of texts per batch.
+        """
+        return 10


Capping the default max_batch_size to 10 in the base EmbeddingProvider class will cause a significant performance regression for all other embedding providers (such as OpenAI, Ollama, Gemini, etc.) that do not override this property. For instance, OpenAI supports up to 2048 texts per batch, but will now be restricted to 10, resulting in many more API requests and potential rate-limiting issues.

To resolve this, we can make max_batch_size configurable via the provider's configuration (e.g., self.provider_config) and default to a much more reasonable value like 100 or 2048.

Suggested change

@property

def max_batch_size(self) -> int:

"""Maximum batch size per single embedding API call.

Subclasses may override this when the backend enforces a more restrictive

limit than the default (e.g., DashScope/Alibaba Cloud limits to 10).

Returns:

The maximum number of texts per batch.

"""

return 10

@property

def max_batch_size(self) -> int:

"""Maximum batch size per single embedding API call.

Subclasses may override this when the backend enforces a more restrictive

limit than the default (e.g., DashScope/Alibaba Cloud limits to 10).

Returns:

The maximum number of texts per batch.

"""

return self.provider_config.get("max_batch_size", 100)

…rogates - Add max_batch_size property to EmbeddingProvider base class, reading from provider_config (default 100). Providers with stricter limits (e.g. DashScope = 10) set it in their config. - get_embeddings_batch enforces this cap before splitting batches. - Add max_batch_size to all embedding provider default templates, and its description/hint to the provider source config metadata schema. - Update DocumentsTab.vue upload batch_size hint to mention provider cap. - Sanitize lone surrogates from PDF-parsed text chunks that would otherwise cause UTF-8 encoding failures during embedding API calls.

zhangli091011 · 2026-06-20T18:09:06Z

@sourcery-ai review

sourcery-ai

Hey - I've left some high level feedback:

In get_embeddings_batch, silently capping batch_size at max_batch_size and only logging at debug level may make it hard to detect misconfiguration in production; consider logging at a higher level or validating and clamping batch_size earlier (e.g., when reading the config or initializing the provider).
On the dashboard, the batch size field now mentions the provider max_batch_size limit in the hint, but still allows arbitrary numbers; consider constraining or auto-adjusting the UI value based on the current provider’s max_batch_size to avoid user confusion.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In `get_embeddings_batch`, silently capping `batch_size` at `max_batch_size` and only logging at debug level may make it hard to detect misconfiguration in production; consider logging at a higher level or validating and clamping `batch_size` earlier (e.g., when reading the config or initializing the provider).
- On the dashboard, the batch size field now mentions the provider `max_batch_size` limit in the hint, but still allows arbitrary numbers; consider constraining or auto-adjusting the UI value based on the current provider’s `max_batch_size` to avoid user confusion.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

dosubot Bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Jun 20, 2026

dosubot Bot added area:provider The bug / feature is about AI Provider, Models, LLM Agent, LLM Agent Runner. feature:knowledge-base The bug / feature is about knowledge base labels Jun 20, 2026

sourcery-ai Bot reviewed Jun 20, 2026

View reviewed changes

gemini-code-assist Bot reviewed Jun 20, 2026

View reviewed changes

zhangli091011 force-pushed the fix/kb-embedding-batch-and-surrogate branch from a5b22f6 to a4012e7 Compare June 20, 2026 17:48

dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Jun 20, 2026

zhangli091011 force-pushed the fix/kb-embedding-batch-and-surrogate branch from a4012e7 to 512eb5f Compare June 20, 2026 18:06

sourcery-ai Bot reviewed Jun 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: cap embedding batch size to provider limit and sanitize lone surrogates#8928

fix: cap embedding batch size to provider limit and sanitize lone surrogates#8928
zhangli091011 wants to merge 1 commit into
AstrBotDevs:masterfrom
zhangli091011:fix/kb-embedding-batch-and-surrogate

zhangli091011 commented Jun 20, 2026 •

edited by sourcery-ai Bot

Loading

Uh oh!

vercel Bot commented Jun 20, 2026

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Uh oh!

zhangli091011 commented Jun 20, 2026

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

zhangli091011 commented Jun 20, 2026 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bug 1: Embedding batch size exceeds provider limit

Bug 2: Lone surrogate characters break UTF-8 encoding

Changes

Testing

Summary by Sourcery

Uh oh!

vercel Bot commented Jun 20, 2026

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

zhangli091011 commented Jun 20, 2026

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zhangli091011 commented Jun 20, 2026 •

edited by sourcery-ai Bot

Loading