Skip to content

fix(cn_index): wrap literal HTML in StringIO for pd.read_html#2258

Open
he-yufeng wants to merge 1 commit into
microsoft:mainfrom
he-yufeng:fix/cn-index-read-html-stringio
Open

fix(cn_index): wrap literal HTML in StringIO for pd.read_html#2258
he-yufeng wants to merge 1 commit into
microsoft:mainfrom
he-yufeng:fix/cn-index-read-html-stringio

Conversation

@he-yufeng

Copy link
Copy Markdown

Description

#2047 wrapped the literal HTML passed to pd.read_html in a StringIO for the us_index collector to silence pandas' FutureWarning:

Passing literal html to 'read_html' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.

The parallel cn_index collector was missed. _parse_table passes the csindex response body (an HTML string, not a URL/path) straight to pd.read_html(content) at scripts/data_collector/cn_index/collector.py:185, so it still hits the same deprecation and will break once pandas drops literal-string support.

Change

Import StringIO and wrap the content, matching what #2047 did for us_index. No behavior change on supported pandas versions; StringIO has always been accepted by read_html.

Reproduced locally on pandas 2.3.3:

>>> pd.read_html("<table>...</table>")        # FutureWarning: Passing literal html ...
>>> pd.read_html(StringIO("<table>...</table>"))  # no warning

Note

I scoped this to the read_html fix to stay a clean parallel of #2047. The same file also calls DataFrame.applymap (line 174), which pandas 2.1 deprecates in favour of DataFrame.map, but DataFrame.map only exists from 2.1 onward while setup.py still supports pandas>=1.1, so swapping it would need version handling and is left out here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant