fix: cap embedding batch size to provider limit and sanitize lone surrogates#8928
Conversation
|
@zhangli091011 is attempting to deploy a commit to the soulter's projects Team on Vercel. A member of the Team first needs to authorize it. |
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- Consider making the default
max_batch_sizeless restrictive (e.g., matching current defaults) and overriding it only in providers like DashScope to avoid unintentionally reducing throughput for all existing providers. - The UTF-8 sanitization in
upload_documentsilently replaces invalid characters; if this is a concern, you might want to centralize this into a helper that can optionally log or count replacements so issues with upstream text extraction can be detected.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider making the default `max_batch_size` less restrictive (e.g., matching current defaults) and overriding it only in providers like DashScope to avoid unintentionally reducing throughput for all existing providers.
- The UTF-8 sanitization in `upload_document` silently replaces invalid characters; if this is a concern, you might want to centralize this into a helper that can optionally log or count replacements so issues with upstream text extraction can be detected.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Code Review
This pull request introduces UTF-8 sanitization for document uploads to handle lone surrogates and adds a default max_batch_size property (set to 10) to limit batch sizes in embedding API calls. The reviewer pointed out that setting the default limit to 10 in the base class could cause performance regressions for other providers (like OpenAI) and suggested making it configurable or using a higher default value.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| @property | ||
| def max_batch_size(self) -> int: | ||
| """Maximum batch size per single embedding API call. | ||
|
|
||
| Subclasses may override this when the backend enforces a more restrictive | ||
| limit than the default (e.g., DashScope/Alibaba Cloud limits to 10). | ||
|
|
||
| Returns: | ||
| The maximum number of texts per batch. | ||
| """ | ||
| return 10 |
There was a problem hiding this comment.
Capping the default max_batch_size to 10 in the base EmbeddingProvider class will cause a significant performance regression for all other embedding providers (such as OpenAI, Ollama, Gemini, etc.) that do not override this property. For instance, OpenAI supports up to 2048 texts per batch, but will now be restricted to 10, resulting in many more API requests and potential rate-limiting issues.
To resolve this, we can make max_batch_size configurable via the provider's configuration (e.g., self.provider_config) and default to a much more reasonable value like 100 or 2048.
| @property | |
| def max_batch_size(self) -> int: | |
| """Maximum batch size per single embedding API call. | |
| Subclasses may override this when the backend enforces a more restrictive | |
| limit than the default (e.g., DashScope/Alibaba Cloud limits to 10). | |
| Returns: | |
| The maximum number of texts per batch. | |
| """ | |
| return 10 | |
| @property | |
| def max_batch_size(self) -> int: | |
| """Maximum batch size per single embedding API call. | |
| Subclasses may override this when the backend enforces a more restrictive | |
| limit than the default (e.g., DashScope/Alibaba Cloud limits to 10). | |
| Returns: | |
| The maximum number of texts per batch. | |
| """ | |
| return self.provider_config.get("max_batch_size", 100) |
a5b22f6 to
a4012e7
Compare
…rogates - Add max_batch_size property to EmbeddingProvider base class, reading from provider_config (default 100). Providers with stricter limits (e.g. DashScope = 10) set it in their config. - get_embeddings_batch enforces this cap before splitting batches. - Add max_batch_size to all embedding provider default templates, and its description/hint to the provider source config metadata schema. - Update DocumentsTab.vue upload batch_size hint to mention provider cap. - Sanitize lone surrogates from PDF-parsed text chunks that would otherwise cause UTF-8 encoding failures during embedding API calls.
a4012e7 to
512eb5f
Compare
|
@sourcery-ai review |
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- In
get_embeddings_batch, silently cappingbatch_sizeatmax_batch_sizeand only logging at debug level may make it hard to detect misconfiguration in production; consider logging at a higher level or validating and clampingbatch_sizeearlier (e.g., when reading the config or initializing the provider). - On the dashboard, the batch size field now mentions the provider
max_batch_sizelimit in the hint, but still allows arbitrary numbers; consider constraining or auto-adjusting the UI value based on the current provider’smax_batch_sizeto avoid user confusion.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `get_embeddings_batch`, silently capping `batch_size` at `max_batch_size` and only logging at debug level may make it hard to detect misconfiguration in production; consider logging at a higher level or validating and clamping `batch_size` earlier (e.g., when reading the config or initializing the provider).
- On the dashboard, the batch size field now mentions the provider `max_batch_size` limit in the hint, but still allows arbitrary numbers; consider constraining or auto-adjusting the UI value based on the current provider’s `max_batch_size` to avoid user confusion.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Summary
Fixes two bugs that cause PDF document upload failures in the knowledge base.
Bug 1: Embedding batch size exceeds provider limit
DashScope (Alibaba Cloud Bailian) embedding API rejects requests with batch size > 10, but the default
batch_sizewas 32. This causedInternalError.Algo.InvalidParameter: batch size is invalid, it should not be larger than 10.Fix: Added
max_batch_sizeproperty toEmbeddingProviderbase class (default 10), andget_embeddings_batchnow automatically caps the batch size to this limit.Bug 2: Lone surrogate characters break UTF-8 encoding
PDF-parsed text can contain isolated UTF-16 surrogates (e.g.,
\ud83dfrom broken emoji codepoints) which cannot be UTF-8 encoded, causing'utf-8' codec can't encode character '\ud83d'when sending to the embedding API.Fix: Sanitize text chunks with
encode('utf-8', errors='replace').decode('utf-8')before passing them to the embedding pipeline.Changes
astrbot/core/provider/provider.pymax_batch_sizeproperty +loggerimport; enforce cap inget_embeddings_batchastrbot/core/knowledge_base/kb_helper.pyTesting
max_batch_sizecan be overridden by subclasses if neededSummary by Sourcery
Cap embedding batch size to provider-specific limits and sanitize invalid text chunks to prevent knowledge base upload failures.
Bug Fixes:
Enhancements: