Feat/add table knowledge base with column-level preprocessing#8939
Feat/add table knowledge base with column-level preprocessing#8939first-dong wants to merge 5 commits into
Conversation
|
@first-dong is attempting to deploy a commit to the soulter's projects Team on Vercel. A member of the Team first needs to authorize it. |
There was a problem hiding this comment.
Hey - I've found 4 issues, and left some high level feedback:
- The
_read_single_uploadhelper’s docstring and return type are misleading (it claims to return(file_name, file_content)but actually returns anUploadFile), which can cause misuse later; either return the documented tuple or update the signature/docstring to match the actual behavior. - In
import_table, the imported file is not checked against the knowledge base type (e.g.kb.kb_type === 'table'), so it may be worth enforcing that table imports only target table-type knowledge bases to avoid inconsistent data. - The temporary file save/read/unlink logic in
preview_tableandimport_tableis nearly identical; consider extracting this into a shared helper to reduce duplication and keep file-handling behavior consistent.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The `_read_single_upload` helper’s docstring and return type are misleading (it claims to return `(file_name, file_content)` but actually returns an `UploadFile`), which can cause misuse later; either return the documented tuple or update the signature/docstring to match the actual behavior.
- In `import_table`, the imported file is not checked against the knowledge base type (e.g. `kb.kb_type === 'table'`), so it may be worth enforcing that table imports only target table-type knowledge bases to avoid inconsistent data.
- The temporary file save/read/unlink logic in `preview_table` and `import_table` is nearly identical; consider extracting this into a shared helper to reduce duplication and keep file-handling behavior consistent.
## Individual Comments
### Comment 1
<location path="astrbot/dashboard/services/knowledge_base_service.py" line_range="611-620" />
<code_context>
raise
+ @staticmethod
+ def _format_row_text(columns: list[dict], row: dict[str, str]) -> str:
+ """Format selected columns of a row as ``name: value`` lines.
</code_context>
<issue_to_address>
**issue:** The helper `_read_single_upload`'s return type and docstring do not match its actual return value.
Downstream code treats the return value as an upload file object (using `.filename` and `.save`), so the current type/doc mismatch makes the API misleading and error-prone. Please either align the implementation with the documented `(file_name, file_content)` tuple and update call sites, or update the type hint/docstring (and ideally the helper name) to reflect that it returns the raw upload file object.
</issue_to_address>
### Comment 2
<location path="astrbot/dashboard/services/knowledge_base_service.py" line_range="750-751" />
<code_context>
+ file_type = file_name.rsplit(".", 1)[-1].lower() if "." in file_name else ""
+ file_size = len(file_content)
+
+ kb_helper = await self.get_kb_manager().get_kb(kb_id)
+ if not kb_helper:
+ raise KnowledgeBaseServiceError("知识库不存在")
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Table import should validate that the target knowledge base is of type `table`.
Right now `import_table` can run for any `kb_id`, including those with `kb_type = 'text'`, which could let table data be written into a text KB and break assumptions elsewhere. After fetching `kb_helper`, add a `kb_helper.kb.kb_type == 'table'` check and raise a `KnowledgeBaseServiceError` if it isn’t, to keep the stored type consistent with the API/UI model.
</issue_to_address>
### Comment 3
<location path="dashboard/src/views/knowledge-base/components/TableImportDialog.vue" line_range="36-37" />
<code_context>
+ </div>
+ </div>
+
+ <v-text-field v-model.number="headerRow" :label="t('table.headerRow')" :hint="t('table.headerRowHint')"
+ persistent-hint type="number" variant="outlined" density="compact" class="mt-4" style="max-width: 240px" />
+ </div>
+
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Clarify or normalize the header row index between UI and backend to avoid off-by-one issues.
Backend `header_row` is 0-based and passed directly to pandas. In the dialog, `headerRow` is shown as a plain numeric field, so users are likely to enter 1-based row numbers.
To avoid off-by-one errors, either interpret the UI value as 1-based and subtract 1 before calling the API, or update the label/hint to explicitly explain that 0 is the first row, 1 the second, etc., so UI and backend semantics stay aligned.
Suggested implementation:
```
<v-text-field
v-model.number="headerRow"
:label="t('table.headerRow')"
:hint="t('table.headerRowHint')"
persistent-hint
type="number"
variant="outlined"
density="compact"
class="mt-4"
style="max-width: 240px"
min="1"
/>
```
To fully implement the normalization and avoid off-by-one issues, you should also:
1. Find where `headerRow` is sent to the backend (likely in a method that builds the import payload).
2. Change the payload field from using `headerRow` directly to `headerRow - 1`, e.g.:
- `header_row: this.headerRow` → `header_row: this.headerRow - 1`.
3. Optionally, update the i18n hint string (`table.headerRowHint`) to explicitly state that “1 is the first row, 2 the second, …” so the UI clearly documents the 1-based semantics.
This keeps the UI 1-based (natural for users) while keeping the backend/pandas call 0-based.
</issue_to_address>
### Comment 4
<location path="astrbot/dashboard/services/knowledge_base_service.py" line_range="633" />
<code_context>
+ raise KnowledgeBaseServiceError("缺少文件")
+ return file_list[0]
+
+ async def preview_table(
+ self,
+ *,
</code_context>
<issue_to_address>
**issue (complexity):** Consider refactoring the new table endpoints to share file-reading and multipart-parsing helpers and to clarify `_read_single_upload`’s contract and typing.
The reviewer’s points are valid; the new endpoints duplicate logic and `_read_single_upload` has a misleading contract. You can reduce complexity without changing behavior by:
1. **Extracting shared file-reading logic**
2. **Extracting shared multipart/form parsing/validation**
3. **Fixing `_read_single_upload`’s typing/docstring mismatch**
### 1. Extract a helper to read file bytes
Both `preview_table` and `import_table` repeat the same temp-file logic. Moving it into a helper keeps behavior identical and makes both endpoints shorter and easier to follow:
```python
async def _read_file_bytes(self, file) -> tuple[str, bytes]:
file_name = file.filename
temp_file_path = (
Path(get_astrbot_temp_path()) / f"kb_table_{uuid.uuid4()}_{file_name}"
)
await file.save(temp_file_path)
try:
async with aiofiles.open(temp_file_path, "rb") as file_obj:
file_content = await file_obj.read()
finally:
temp_file_path.unlink(missing_ok=True)
return file_name, file_content
```
Then `preview_table` and `import_table` become:
```python
file = self._read_single_upload(form_data, files)
file_name, file_content = await self._read_file_bytes(file)
```
`preview_table`:
```python
file = self._read_single_upload(form_data, files)
file_name, file_content = await self._read_file_bytes(file)
from astrbot.core.knowledge_base.parsers.table_parser import (
is_table_file,
parse_table,
)
if not is_table_file(file_name):
raise KnowledgeBaseServiceError("不支持的表格格式,请上传 csv/xls/xlsx 文件")
try:
result = parse_table(file_content, file_name, header_row=header_row)
except Exception as exc:
raise KnowledgeBaseServiceError(f"解析表格失败: {exc!s}") from exc
```
`import_table`:
```python
file = self._read_single_upload(form_data, files)
file_name, file_content = await self._read_file_bytes(file)
file_type = file_name.rsplit(".", 1)[-1].lower() if "." in file_name else ""
file_size = len(file_content)
```
This keeps all semantics (temp file, cleanup, same naming), but removes duplicated branches and file-handling logic.
### 2. Extract shared multipart/form validation
The multipart and parameter parsing logic is almost identical between `preview_table` and `import_table`. You can encapsulate this in a small helper that returns parsed primitives:
```python
from dataclasses import dataclass
@dataclass
class TableRequestParams:
kb_id: str
header_row: int
def _parse_table_common_params(self, content_type: str | None, form_data) -> TableRequestParams:
if content_type and "multipart/form-data" not in content_type:
raise KnowledgeBaseServiceError("Content-Type 须为 multipart/form-data")
kb_id = form_data.get("kb_id")
if not kb_id:
raise KnowledgeBaseServiceError("缺少参数 kb_id")
header_row = int(form_data.get("header_row", 0))
return TableRequestParams(kb_id=kb_id, header_row=header_row)
```
Then:
```python
params = self._parse_table_common_params(content_type, form_data)
# preview_table
header_row = params.header_row
# import_table
kb_id = params.kb_id
header_row = params.header_row
```
`import_table` can still keep its own extra fields:
```python
batch_size = int(form_data.get("batch_size", 32))
tasks_limit = int(form_data.get("tasks_limit", 3))
max_retries = int(form_data.get("max_retries", 3))
```
This removes duplicated Content-Type / kb_id / header_row handling and keeps the endpoints focused on their unique logic.
### 3. Fix `_read_single_upload` typing / contract
Right now `_read_single_upload` is annotated as returning `tuple[str, bytes]` and the docstring says it returns `(file_name, file_content)`, but it actually returns an `UploadFile` instance. That mismatch makes the call sites harder to reason about.
You can either:
**Option A – keep returning UploadFile and fix types/docstring:**
```python
from typing import Any
@staticmethod
def _read_single_upload(form_data: Any, files: Any):
"""Return the first UploadFile from the multipart request.
Raises:
KnowledgeBaseServiceError: If no file is present.
"""
file_list = []
for key in files.keys():
if key == "file" or key.startswith("file") or key == "files[]":
file_list.extend(files.getlist(key))
if not file_list:
raise KnowledgeBaseServiceError("缺少文件")
return file_list[0]
```
**Option B – make it actually return `(file_name, file_content)` and adjust callers.**
Since you already have `_read_file_bytes`, Option A plus using that helper (as above) keeps behavior unchanged with minimal refactor.
---
These small extractions make `preview_table` / `import_table` shorter and more linear, reduce duplication, and align the helper method contract with its actual behavior, without removing or altering any features.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Code Review
This pull request introduces a structured table knowledge base feature, allowing users to import CSV, XLS, and XLSX files where each row is treated as an independent knowledge unit. It includes a new table parser, database migrations for the new kb_type and table_schema fields, backend API endpoints for table preview and import, and corresponding frontend updates. The review feedback is highly constructive, pointing out several opportunities to enhance robustness. Key recommendations include fixing header de-duplication to prevent collisions, catching broader exceptions during CSV encoding detection, handling None values gracefully when formatting row text, refactoring duplicate file-saving logic into a shared helper, and adding defensive validation for numeric parameters and configuration schemas.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
当前知识库主要面向非结构化文档,表格类数据导入后无法保留清晰的行列结构,也无法区分“用于语义检索的字段”和“命中后需要返回的字段”。本次变更新增表格知识库能力,支持按行导入 csv/xls/xlsx 表格,并在导入前进行列级预处理配置,让结构化数据能够更精确地参与知识库检索。
Modifications / 改动点
kb_type。Screenshots or Test Results / 运行截图或测试结果
Summary by Sourcery
Add structured table knowledge base support with column-level preprocessing and import, including backend APIs, vector store enhancements, migrations, and Web UI flows.
New Features:
Enhancements:
Documentation: