Skip to content

[fix](statistics) full analyze not collect hot value by default#63625

Open
yujun777 wants to merge 2 commits into
apache:masterfrom
yujun777:fix-full-analyze-hot-value
Open

[fix](statistics) full analyze not collect hot value by default#63625
yujun777 wants to merge 2 commits into
apache:masterfrom
yujun777:fix-full-analyze-hot-value

Conversation

@yujun777
Copy link
Copy Markdown
Contributor

@yujun777 yujun777 commented May 25, 2026

#62435 let full analyze always collect hot value, but excute may exceed statistics sql memory limit (default 2GB) for big table.

Keep sample analyze hot value collection unchanged while making manual full analyze require explicit WITH HOT VALUE. Auto full analyze continues to skip hot values, and auto sample still collects them, no change behaviour.

usage:

analyze table t with sync with hot value

Tests:

  • FE UT: AnalyzeTableCommandTest, OlapAnalysisTaskTest, AnalysisManagerTest
  • Regression: test_hot_value, test_full_analyze_hot_value

Docs PR: apache/doris-website#3769

yujun777 added 2 commits May 25, 2026 17:10
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Full analyze can spend excessive memory collecting hot values for high-cardinality columns. This change adds a collect.hot.value analyze property so manual full analyze skips hot value collection by default, manual sample analyze keeps collecting hot values by default, and explicit properties can override both. Automatic analyze keeps the previous nullable internal setting so existing behavior is not changed. The no-hot-value analyze SQL templates directly return null hot_value instead of relying on optimizer simplification.

### Release note

Manual ANALYZE supports PROPERTIES("collect.hot.value"="true/false") to control hot value collection.

### Check List (For Author)

- Test: Unit Test and Regression test
    - ./build.sh --fe
    - bash ./run-fe-ut.sh --run OlapAnalysisTaskTest,HMSAnalysisTaskTest,AnalyzeTableCommandTest,AnalysisManagerTest
    - bash ./run-fe-ut.sh --run StatisticsUtilTest
    - sh run-regression-test.sh --run -d statistics -s test_full_analyze_hot_value,test_hot_value
    - sh run-regression-test.sh --run -d mv_p0/ssb/q_4_1_r1 -s q_4_1_r1
    - sh run-regression-test.sh --run -d nereids_rules_p0/distinct_split -s distinct_split
- Behavior changed: Yes. Manual full analyze no longer collects hot value by default; manual sample analyze and automatic analyze keep previous defaults unless the new property is explicitly set.
- Does this need documentation: Yes
Keep sample analyze hot value collection unchanged while making manual full analyze require explicit WITH HOT VALUE. Auto full analyze keeps not collecting hot values, and auto sample continues collecting them.

Key changes:
- Add WITH HOT VALUE parsing for analyze statements.
- Reject WITH HOT VALUE on sample analyze because sample always collects hot values.
- Remove sample no-hot-value SQL templates and keep full no-hot-value SQL on the old lightweight path.
- Set auto sample jobs to collect hot values explicitly.

Unit Test:
- AnalyzeTableCommandTest
- OlapAnalysisTaskTest
- AnalysisManagerTest
- Regression: test_hot_value,test_full_analyze_hot_value
@yujun777
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@yujun777 yujun777 changed the title [fix](fe) Refine hot value collection for analyze [fix](statistics) full analyze not collect hot value by default May 25, 2026
@yujun777 yujun777 marked this pull request as draft May 25, 2026 12:08
@yujun777 yujun777 marked this pull request as ready for review May 25, 2026 12:08
@yujun777 yujun777 marked this pull request as draft May 25, 2026 12:23
@yujun777 yujun777 marked this pull request as ready for review May 25, 2026 12:23
@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31661 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 0f10248e0745886036d80bd0c0681da5f15a0b63, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17749	4094	4107	4094
q2	q3	10793	1413	812	812
q4	4685	474	349	349
q5	7797	2281	2093	2093
q6	376	179	140	140
q7	959	774	652	652
q8	9343	1745	1654	1654
q9	7088	5018	4999	4999
q10	6482	2232	1914	1914
q11	443	267	251	251
q12	695	427	307	307
q13	18210	3436	2832	2832
q14	267	253	235	235
q15	q16	823	773	704	704
q17	997	927	977	927
q18	6924	5687	5506	5506
q19	1226	1310	1137	1137
q20	511	393	257	257
q21	5808	2628	2490	2490
q22	450	363	308	308
Total cold run time: 101626 ms
Total hot run time: 31661 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4437	4431	4403	4403
q2	q3	4575	4971	4422	4422
q4	2170	2304	1438	1438
q5	4493	4409	5172	4409
q6	250	196	148	148
q7	1996	1871	1703	1703
q8	2583	2281	2180	2180
q9	7923	8108	8058	8058
q10	4912	4792	4282	4282
q11	575	437	411	411
q12	787	760	559	559
q13	3382	3693	3032	3032
q14	296	316	295	295
q15	q16	739	766	651	651
q17	1430	1338	1371	1338
q18	8004	7437	7017	7017
q19	1106	1092	1114	1092
q20	2245	2224	1948	1948
q21	5418	4693	4680	4680
q22	532	507	446	446
Total cold run time: 57853 ms
Total hot run time: 52512 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 173783 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 0f10248e0745886036d80bd0c0681da5f15a0b63, data reload: false

query5	4329	667	536	536
query6	325	224	196	196
query7	4222	571	302	302
query8	331	241	233	233
query9	8861	4251	4255	4251
query10	453	365	314	314
query11	5834	2570	2267	2267
query12	189	135	131	131
query13	1311	613	476	476
query14	6172	5545	5244	5244
query14_1	4545	4566	4581	4566
query15	216	208	186	186
query16	1090	472	445	445
query17	1172	746	629	629
query18	2653	503	371	371
query19	227	216	172	172
query20	144	138	143	138
query21	219	143	125	125
query22	13686	13538	13343	13343
query23	17594	16620	16404	16404
query23_1	16527	16534	16476	16476
query24	8085	1802	1332	1332
query24_1	1338	1328	1388	1328
query25	576	514	452	452
query26	1219	344	186	186
query27	3900	533	353	353
query28	4625	2098	2075	2075
query29	1030	673	524	524
query30	323	244	209	209
query31	1159	1102	981	981
query32	98	80	81	80
query33	553	383	315	315
query34	1209	1192	715	715
query35	768	798	684	684
query36	1388	1445	1259	1259
query37	151	112	91	91
query38	3271	3162	3099	3099
query39	927	920	908	908
query39_1	890	896	888	888
query40	224	143	128	128
query41	65	64	62	62
query42	108	109	108	108
query43	332	344	317	317
query44	
query45	213	206	198	198
query46	1083	1205	751	751
query47	2388	2422	2280	2280
query48	400	423	297	297
query49	625	491	389	389
query50	1041	370	248	248
query51	4433	4377	4321	4321
query52	107	106	96	96
query53	259	291	206	206
query54	313	269	273	269
query55	101	92	85	85
query56	308	308	305	305
query57	1450	1452	1334	1334
query58	295	267	274	267
query59	1622	1678	1481	1481
query60	324	343	310	310
query61	161	157	151	151
query62	699	656	584	584
query63	249	207	225	207
query64	2216	802	626	626
query65	
query66	1640	479	366	366
query67	30228	30272	29989	29989
query68	
query69	477	344	307	307
query70	1049	994	1003	994
query71	306	273	269	269
query72	2978	2726	2405	2405
query73	870	758	408	408
query74	5118	5013	4789	4789
query75	2710	2630	2296	2296
query76	2303	1167	764	764
query77	443	427	348	348
query78	12434	12549	11827	11827
query79	1444	1075	786	786
query80	656	542	455	455
query81	456	282	245	245
query82	1322	164	130	130
query83	352	287	254	254
query84	263	148	117	117
query85	897	544	451	451
query86	405	354	363	354
query87	3454	3431	3287	3287
query88	3637	2741	2707	2707
query89	442	393	347	347
query90	1990	183	181	181
query91	187	171	139	139
query92	83	76	74	74
query93	1581	1434	867	867
query94	521	334	305	305
query95	686	479	357	357
query96	1051	797	364	364
query97	2763	2721	2637	2637
query98	242	231	237	231
query99	1176	1166	1048	1048
Total cold run time: 257776 ms
Total hot run time: 173783 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 90.00% (27/30) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants