Skip to content

Feat/add table knowledge base with column-level preprocessing#8939

Open
first-dong wants to merge 5 commits into
AstrBotDevs:masterfrom
first-dong:feat/knowledge-base-table
Open

Feat/add table knowledge base with column-level preprocessing#8939
first-dong wants to merge 5 commits into
AstrBotDevs:masterfrom
first-dong:feat/knowledge-base-table

Conversation

@first-dong

@first-dong first-dong commented Jun 21, 2026

Copy link
Copy Markdown

当前知识库主要面向非结构化文档,表格类数据导入后无法保留清晰的行列结构,也无法区分“用于语义检索的字段”和“命中后需要返回的字段”。本次变更新增表格知识库能力,支持按行导入 csv/xls/xlsx 表格,并在导入前进行列级预处理配置,让结构化数据能够更精确地参与知识库检索。

Modifications / 改动点

  • 新增表格知识库类型,支持创建文本知识库或表格知识库,并在数据库中持久化 kb_type
  • 新增表格文档解析器,支持 csv/xls/xlsx 文件解析、表头行配置、空表头补齐和重复表头去重。
  • 新增表格导入流程:每一行作为独立知识单元写入知识库,索引列用于向量化检索,返回列用于命中后展示。
  • 扩展向量库批量写入能力,支持将“向量化文本”和“实际存储内容”解耦。
  • 新增表格预览与表格导入接口,并更新 OpenAPI 定义和前端 API 客户端。
  • WebUI 新增表格知识库创建入口、知识库类型标识、表格导入弹窗、数据预览、索引列/返回列配置和后台导入进度展示。
  • 补充中英文和俄文国际化文案。
  • 新增数据库迁移,旧知识库默认保持为文本知识库,不影响已有数据。

Screenshots or Test Results / 运行截图或测试结果

0fcb580f-602e-4359-af84-35d6abfc33ab ffec7c4a-2d4c-4059-812b-0217a1a5093f 271cad3e-6ba2-47b1-84ed-8d74b0dcabe7

Summary by Sourcery

Add structured table knowledge base support with column-level preprocessing and import, including backend APIs, vector store enhancements, migrations, and Web UI flows.

New Features:

  • Introduce a table-type knowledge base alongside text, persisted via a new kb_type field.
  • Add backend support to parse csv/xls/xlsx files into row-based knowledge units with configurable index/return columns.
  • Expose new OpenAPI and dashboard API endpoints to preview tables and import them with column configuration.
  • Add Web UI workflows for creating table knowledge bases and importing tables with preview and column mapping.

Enhancements:

  • Extend vector database batch insertion to decouple embedding text from stored content for table rows.
  • Track table-specific metadata in knowledge base documents and surface table row details when listing chunks.
  • Add a new SQLite migration version to backfill kb_type and table_schema columns for existing databases.

Documentation:

  • Update OpenAPI schema and generated client types to describe table preview/import requests and kb_type configuration.
  • Add i18n strings for table knowledge base features in English, Chinese, and Russian.

@dosubot dosubot Bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Jun 21, 2026
@vercel

vercel Bot commented Jun 21, 2026

Copy link
Copy Markdown

@first-dong is attempting to deploy a commit to the soulter's projects Team on Vercel.

A member of the Team first needs to authorize it.

@dosubot dosubot Bot added area:webui The bug / feature is about webui(dashboard) of astrbot. feature:knowledge-base The bug / feature is about knowledge base labels Jun 21, 2026

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 4 issues, and left some high level feedback:

  • The _read_single_upload helper’s docstring and return type are misleading (it claims to return (file_name, file_content) but actually returns an UploadFile), which can cause misuse later; either return the documented tuple or update the signature/docstring to match the actual behavior.
  • In import_table, the imported file is not checked against the knowledge base type (e.g. kb.kb_type === 'table'), so it may be worth enforcing that table imports only target table-type knowledge bases to avoid inconsistent data.
  • The temporary file save/read/unlink logic in preview_table and import_table is nearly identical; consider extracting this into a shared helper to reduce duplication and keep file-handling behavior consistent.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The `_read_single_upload` helper’s docstring and return type are misleading (it claims to return `(file_name, file_content)` but actually returns an `UploadFile`), which can cause misuse later; either return the documented tuple or update the signature/docstring to match the actual behavior.
- In `import_table`, the imported file is not checked against the knowledge base type (e.g. `kb.kb_type === 'table'`), so it may be worth enforcing that table imports only target table-type knowledge bases to avoid inconsistent data.
- The temporary file save/read/unlink logic in `preview_table` and `import_table` is nearly identical; consider extracting this into a shared helper to reduce duplication and keep file-handling behavior consistent.

## Individual Comments

### Comment 1
<location path="astrbot/dashboard/services/knowledge_base_service.py" line_range="611-620" />
<code_context>

             raise

+    @staticmethod
+    def _format_row_text(columns: list[dict], row: dict[str, str]) -> str:
+        """Format selected columns of a row as ``name: value`` lines.
</code_context>
<issue_to_address>
**issue:** The helper `_read_single_upload`'s return type and docstring do not match its actual return value.

Downstream code treats the return value as an upload file object (using `.filename` and `.save`), so the current type/doc mismatch makes the API misleading and error-prone. Please either align the implementation with the documented `(file_name, file_content)` tuple and update call sites, or update the type hint/docstring (and ideally the helper name) to reflect that it returns the raw upload file object.
</issue_to_address>

### Comment 2
<location path="astrbot/dashboard/services/knowledge_base_service.py" line_range="750-751" />
<code_context>
+        file_type = file_name.rsplit(".", 1)[-1].lower() if "." in file_name else ""
+        file_size = len(file_content)
+
+        kb_helper = await self.get_kb_manager().get_kb(kb_id)
+        if not kb_helper:
+            raise KnowledgeBaseServiceError("知识库不存在")
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Table import should validate that the target knowledge base is of type `table`.

Right now `import_table` can run for any `kb_id`, including those with `kb_type = 'text'`, which could let table data be written into a text KB and break assumptions elsewhere. After fetching `kb_helper`, add a `kb_helper.kb.kb_type == 'table'` check and raise a `KnowledgeBaseServiceError` if it isn’t, to keep the stored type consistent with the API/UI model.
</issue_to_address>

### Comment 3
<location path="dashboard/src/views/knowledge-base/components/TableImportDialog.vue" line_range="36-37" />
<code_context>
+            </div>
+          </div>
+
+          <v-text-field v-model.number="headerRow" :label="t('table.headerRow')" :hint="t('table.headerRowHint')"
+            persistent-hint type="number" variant="outlined" density="compact" class="mt-4" style="max-width: 240px" />
+        </div>
+
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Clarify or normalize the header row index between UI and backend to avoid off-by-one issues.

Backend `header_row` is 0-based and passed directly to pandas. In the dialog, `headerRow` is shown as a plain numeric field, so users are likely to enter 1-based row numbers.

To avoid off-by-one errors, either interpret the UI value as 1-based and subtract 1 before calling the API, or update the label/hint to explicitly explain that 0 is the first row, 1 the second, etc., so UI and backend semantics stay aligned.

Suggested implementation:

```
          <v-text-field
            v-model.number="headerRow"
            :label="t('table.headerRow')"
            :hint="t('table.headerRowHint')"
            persistent-hint
            type="number"
            variant="outlined"
            density="compact"
            class="mt-4"
            style="max-width: 240px"
            min="1"
          />

```

To fully implement the normalization and avoid off-by-one issues, you should also:
1. Find where `headerRow` is sent to the backend (likely in a method that builds the import payload).
2. Change the payload field from using `headerRow` directly to `headerRow - 1`, e.g.:
   - `header_row: this.headerRow``header_row: this.headerRow - 1`.
3. Optionally, update the i18n hint string (`table.headerRowHint`) to explicitly state that “1 is the first row, 2 the second, …” so the UI clearly documents the 1-based semantics.
This keeps the UI 1-based (natural for users) while keeping the backend/pandas call 0-based.
</issue_to_address>

### Comment 4
<location path="astrbot/dashboard/services/knowledge_base_service.py" line_range="633" />
<code_context>
+            raise KnowledgeBaseServiceError("缺少文件")
+        return file_list[0]
+
+    async def preview_table(
+        self,
+        *,
</code_context>
<issue_to_address>
**issue (complexity):** Consider refactoring the new table endpoints to share file-reading and multipart-parsing helpers and to clarify `_read_single_upload`’s contract and typing.

The reviewer’s points are valid; the new endpoints duplicate logic and `_read_single_upload` has a misleading contract. You can reduce complexity without changing behavior by:

1. **Extracting shared file-reading logic**
2. **Extracting shared multipart/form parsing/validation**
3. **Fixing `_read_single_upload`’s typing/docstring mismatch**

### 1. Extract a helper to read file bytes

Both `preview_table` and `import_table` repeat the same temp-file logic. Moving it into a helper keeps behavior identical and makes both endpoints shorter and easier to follow:

```python
async def _read_file_bytes(self, file) -> tuple[str, bytes]:
    file_name = file.filename
    temp_file_path = (
        Path(get_astrbot_temp_path()) / f"kb_table_{uuid.uuid4()}_{file_name}"
    )
    await file.save(temp_file_path)
    try:
        async with aiofiles.open(temp_file_path, "rb") as file_obj:
            file_content = await file_obj.read()
    finally:
        temp_file_path.unlink(missing_ok=True)
    return file_name, file_content
```

Then `preview_table` and `import_table` become:

```python
file = self._read_single_upload(form_data, files)
file_name, file_content = await self._read_file_bytes(file)
```

`preview_table`:

```python
file = self._read_single_upload(form_data, files)
file_name, file_content = await self._read_file_bytes(file)

from astrbot.core.knowledge_base.parsers.table_parser import (
    is_table_file,
    parse_table,
)

if not is_table_file(file_name):
    raise KnowledgeBaseServiceError("不支持的表格格式,请上传 csv/xls/xlsx 文件")

try:
    result = parse_table(file_content, file_name, header_row=header_row)
except Exception as exc:
    raise KnowledgeBaseServiceError(f"解析表格失败: {exc!s}") from exc
```

`import_table`:

```python
file = self._read_single_upload(form_data, files)
file_name, file_content = await self._read_file_bytes(file)
file_type = file_name.rsplit(".", 1)[-1].lower() if "." in file_name else ""
file_size = len(file_content)
```

This keeps all semantics (temp file, cleanup, same naming), but removes duplicated branches and file-handling logic.

### 2. Extract shared multipart/form validation

The multipart and parameter parsing logic is almost identical between `preview_table` and `import_table`. You can encapsulate this in a small helper that returns parsed primitives:

```python
from dataclasses import dataclass

@dataclass
class TableRequestParams:
    kb_id: str
    header_row: int

def _parse_table_common_params(self, content_type: str | None, form_data) -> TableRequestParams:
    if content_type and "multipart/form-data" not in content_type:
        raise KnowledgeBaseServiceError("Content-Type 须为 multipart/form-data")

    kb_id = form_data.get("kb_id")
    if not kb_id:
        raise KnowledgeBaseServiceError("缺少参数 kb_id")

    header_row = int(form_data.get("header_row", 0))
    return TableRequestParams(kb_id=kb_id, header_row=header_row)
```

Then:

```python
params = self._parse_table_common_params(content_type, form_data)
# preview_table
header_row = params.header_row
# import_table
kb_id = params.kb_id
header_row = params.header_row
```

`import_table` can still keep its own extra fields:

```python
batch_size = int(form_data.get("batch_size", 32))
tasks_limit = int(form_data.get("tasks_limit", 3))
max_retries = int(form_data.get("max_retries", 3))
```

This removes duplicated Content-Type / kb_id / header_row handling and keeps the endpoints focused on their unique logic.

### 3. Fix `_read_single_upload` typing / contract

Right now `_read_single_upload` is annotated as returning `tuple[str, bytes]` and the docstring says it returns `(file_name, file_content)`, but it actually returns an `UploadFile` instance. That mismatch makes the call sites harder to reason about.

You can either:

**Option A – keep returning UploadFile and fix types/docstring:**

```python
from typing import Any

@staticmethod
def _read_single_upload(form_data: Any, files: Any):
    """Return the first UploadFile from the multipart request.

    Raises:
        KnowledgeBaseServiceError: If no file is present.
    """
    file_list = []
    for key in files.keys():
        if key == "file" or key.startswith("file") or key == "files[]":
            file_list.extend(files.getlist(key))
    if not file_list:
        raise KnowledgeBaseServiceError("缺少文件")
    return file_list[0]
```

**Option B – make it actually return `(file_name, file_content)` and adjust callers.**  
Since you already have `_read_file_bytes`, Option A plus using that helper (as above) keeps behavior unchanged with minimal refactor.

---

These small extractions make `preview_table` / `import_table` shorter and more linear, reduce duplication, and align the helper method contract with its actual behavior, without removing or altering any features.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread astrbot/dashboard/services/knowledge_base_service.py Outdated
Comment thread astrbot/dashboard/services/knowledge_base_service.py
Comment thread dashboard/src/views/knowledge-base/components/TableImportDialog.vue
Comment thread astrbot/dashboard/services/knowledge_base_service.py

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a structured table knowledge base feature, allowing users to import CSV, XLS, and XLSX files where each row is treated as an independent knowledge unit. It includes a new table parser, database migrations for the new kb_type and table_schema fields, backend API endpoints for table preview and import, and corresponding frontend updates. The review feedback is highly constructive, pointing out several opportunities to enhance robustness. Key recommendations include fixing header de-duplication to prevent collisions, catching broader exceptions during CSV encoding detection, handling None values gracefully when formatting row text, refactoring duplicate file-saving logic into a shared helper, and adding defensive validation for numeric parameters and configuration schemas.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread astrbot/core/knowledge_base/parsers/table_parser.py
Comment thread astrbot/core/knowledge_base/parsers/table_parser.py Outdated
Comment thread astrbot/core/knowledge_base/kb_helper.py Outdated
Comment thread astrbot/dashboard/services/knowledge_base_service.py
Comment thread astrbot/dashboard/services/knowledge_base_service.py Outdated
Comment thread astrbot/dashboard/services/knowledge_base_service.py Outdated
@first-dong first-dong closed this Jun 21, 2026
@first-dong first-dong reopened this Jun 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:webui The bug / feature is about webui(dashboard) of astrbot. feature:knowledge-base The bug / feature is about knowledge base size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant