[feature](inverted-index) Add Japanese (Kuromoji) morphological analyzer by nishant94 · Pull Request #64667 · apache/doris

nishant94 · 2026-06-22T05:53:45Z

What problem does this PR solve?

Issue Number: #64646

Related PR: None

Problem Summary:
Doris has no Japanese-aware tokenizer for the inverted index. Japanese text has no spaces between words, so the existing parsers can't segment it and MATCH / MATCH_PHRASE on Japanese columns end up with poor recall and precision.

This PR adds a built-in kuromoji parser for Japanese, in the same style as the existing Chinese IK analyzer. It's opt-in per column:

 INDEX content_idx (`content`) USING INVERTED
 PROPERTIES("parser" = "kuromoji", "parser_mode" = "search");

After indexing, MATCH, MATCH_PHRASE and TOKENIZE() run against the segmented Japanese terms.

How it works:

Native C++ under be/src/storage/index/inverted/analyzer/kuromoji/, so there's no JVM on the indexing path. KuromojiAnalyzer / KuromojiTokenizer mirror the IK analyzer/tokenizer, with a Viterbi cost-model segmenter over the IPADIC connection-cost matrix.
- The dictionary is a process-wide singleton loaded once from ${inverted_index_dict_path}/kuromoji. An offline converter compiles raw IPADIC into a compact C++ runtime format (double-array trie + cost matrix + char/unknown tables) at build time, so no binary blob is committed.
- search (default), normal and extended modes are supported. No thrift/proto changes — parser and mode ride as strings in the index properties.

Dictionary source is mecab-ipadic-2.7.0-20070801 (NAIST-2003 license, the same lexicon Lucene kuromoji uses).

Release note

Support Japanese text tokenization in the inverted index via a new kuromoji parser (PROPERTIES("parser"="kuromoji")), with search/normal/extended modes.

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)

  CREATE TABLE test_jp (
    id BIGINT,
    content TEXT,
    INDEX idx_content (content) USING INVERTED
      PROPERTIES("parser" = "kuromoji", "parser_mode" = "search")
  ) ENGINE=OLAP
  DUPLICATE KEY(id)
  DISTRIBUTED BY HASH(id) BUCKETS 1
  PROPERTIES("replication_num" = "1");

  INSERT INTO test_jp VALUES
    (1, '東京都に住んでいます'),
    (2, '日本語の形態素解析エンジン');

  -- search-mode decompounding: 東京都 also matches 東京
  SELECT id FROM test_jp WHERE content MATCH '東京';          -- expect: 1
  SELECT id FROM test_jp WHERE content MATCH_PHRASE '形態素解析'; -- expect: 2

  -- inspect segmentation directly
  SELECT TOKENIZE('東京都に住んでいます', '"parser"="kuromoji","parser_mode"="search"');

Behavior changed:
- No.
- Yes. It adds a new opt-in kuromoji parser. Existing parsers and their output are unchanged; the new behavior only applies to indexes that explicitly set parser="kuromoji".
Does this need documentation?
- No.
- Yes. PR Link to Doris-Website.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

hello-stephen · 2026-06-22T05:53:50Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

nishant94 · 2026-06-22T06:23:33Z

run buildall

yiguolei · 2026-06-22T06:27:35Z

@nishant94 have you tried icu analyzer? because I think icu could handle many different languages.

nishant94 · 2026-06-22T06:31:13Z

@nishant94 have you tried icu analyzer? because I think icu could handle many different languages.

@yiguolei The ICU Analyzer is not good as the Kuromoji. There is huge difference between icu and kuromoji when it comes to morphology of the Japanese words. So I think it worth it adding this new parser.

BiteTheDDDDt · 2026-06-22T07:02:18Z

Is the code under be/src/storage/index/inverted/analyzer/kuromoji entirely original or derived from other projects? Perhaps we need to clarify the situation regarding this part.

nishant94 · 2026-06-22T07:29:43Z

Is the code under be/src/storage/index/inverted/analyzer/kuromoji entirely original or derived from other projects? Perhaps we need to clarify the situation regarding this part.

This is original code but it is modeled on Apache Lucene's kuromoji.

hello-stephen · 2026-06-22T07:56:03Z

FE UT Coverage Report

Increment line coverage 44.44% (4/9) 🎉
Increment coverage report
Complete coverage report

nishant94 · 2026-06-22T09:58:03Z

run buildall

hello-stephen · 2026-06-22T12:00:26Z

BE UT Coverage Report

Increment line coverage 82.40% (791/960) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	54.51% (21439/39327)
Line Coverage	38.17% (205347/537919)
Region Coverage	34.16% (161044/471416)
Branch Coverage	35.14% (70517/200651)

hello-stephen · 2026-06-22T15:49:21Z

BE UT Coverage Report

Increment line coverage 84.10% (836/994) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	54.50% (21433/39329)
Line Coverage	38.13% (205092/537920)
Region Coverage	34.11% (160793/471446)
Branch Coverage	35.11% (70468/200678)

hello-stephen · 2026-06-22T16:58:18Z

BE Regression && UT Coverage Report

Increment line coverage 83.85% (462/551) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	74.11% (28441/38375)
Line Coverage	58.02% (309954/534209)
Region Coverage	54.69% (258833/473301)
Branch Coverage	56.10% (112608/200725)

hello-stephen · 2026-06-22T17:05:28Z

FE Regression Coverage Report

Increment line coverage 66.67% (6/9) 🎉
Increment coverage report
Complete coverage report

nishant94 · 2026-06-23T06:16:44Z

run buildall

hello-stephen · 2026-06-23T11:38:51Z

FE UT Coverage Report

Increment line coverage 44.44% (4/9) 🎉
Increment coverage report
Complete coverage report

hello-stephen · 2026-06-23T11:55:51Z

FE Regression Coverage Report

Increment line coverage 35.29% (6/17) 🎉
Increment coverage report
Complete coverage report

hello-stephen · 2026-06-23T14:39:39Z

BE UT Coverage Report

Increment line coverage 84.10% (836/994) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	54.72% (21523/39332)
Line Coverage	38.18% (205493/538169)
Region Coverage	34.17% (161179/471738)
Branch Coverage	35.13% (70561/200832)

hello-stephen · 2026-06-23T14:44:04Z

BE Regression && UT Coverage Report

Increment line coverage 83.85% (462/551) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	74.19% (28466/38371)
Line Coverage	58.03% (310151/534436)
Region Coverage	54.77% (259367/473580)
Branch Coverage	56.13% (112755/200875)

nishant94 · 2026-06-24T05:18:18Z

run buildall

hello-stephen · 2026-06-24T06:46:43Z

FE UT Coverage Report

Increment line coverage 44.44% (4/9) 🎉
Increment coverage report
Complete coverage report

hello-stephen · 2026-06-24T10:23:58Z

BE UT Coverage Report

Increment line coverage 84.20% (842/1000) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	54.58% (21476/39348)
Line Coverage	38.09% (205045/538313)
Region Coverage	34.07% (160754/471838)
Branch Coverage	35.04% (70387/200890)

hello-stephen · 2026-06-24T14:28:27Z

BE Regression && UT Coverage Report

Increment line coverage 84.02% (468/557) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	74.22% (28491/38387)
Line Coverage	58.10% (310621/534591)
Region Coverage	55.02% (260603/473686)
Branch Coverage	56.29% (113100/200937)

nishant94 · 2026-06-27T06:27:48Z

run buildall

hello-stephen · 2026-06-27T08:00:55Z

FE UT Coverage Report

Increment line coverage 44.44% (4/9) 🎉
Increment coverage report
Complete coverage report

hello-stephen · 2026-06-27T12:22:49Z

BE Regression && UT Coverage Report

Increment line coverage 84.02% (468/557) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	74.20% (28536/38456)
Line Coverage	58.09% (310843/535090)
Region Coverage	54.89% (260112/473920)
Branch Coverage	56.16% (112901/201051)

hello-stephen · 2026-06-27T12:34:13Z

FE Regression Coverage Report

Increment line coverage 66.67% (6/9) 🎉
Increment coverage report
Complete coverage report

Ryan19929 · 2026-06-29T06:36:19Z

Since this is modeled after Lucene Kuromoji, have you checked how the Doris implementation performs in practice? A small benchmark for indexing throughput would be helpful.

airborne12

Requesting changes. The main issue is that parser=kuromoji is exposed as a production analyzer, but the default package path does not guarantee the runtime dictionary exists, and the runtime silently falls back to per-codepoint tokenization. That can build/query inverted indexes with semantics different from the documented Kuromoji morphology. I also found dictionary loader validation gaps and FE/BE/TOKENIZE default/validation inconsistencies.

airborne12 · 2026-06-29T07:32:07Z

+install(DIRECTORY
+    ${BASE_DIR}/dict/kuromoji
+    DESTINATION ${OUTPUT_DIR}/dict
+    OPTIONAL)


[Blocking] This install rule does not guarantee that the runtime Kuromoji dictionary is present. The generated dict/kuromoji/*.bin files are ignored/not committed, kuromoji_build_dict is EXCLUDE_FROM_ALL, and kuromoji_dict is only a manual target. A default package can therefore ship only dict/kuromoji/README.md, while the BE later loads ${inverted_index_dict_path}/kuromoji.

Please make package/install depend on dictionary generation and fail if system.bin, matrix.bin, chardef.bin, and unkdict.bin are missing. OPTIONAL should not hide a missing required analyzer artifact.

Agreed, shipping a package that only contains the README would be a trap.

I made two changes to handle this:

kuromoji_dict now builds as part of ALL, so a normal build generates system.bin/matrix.bin/chardef.bin/unkdict.bin from the staged mecab-ipadic.

Dropped OPTIONAL and added an install-time check that FATAL_ERRORs if any of the four files are missing.

airborne12 · 2026-06-29T07:32:07Z

+    // Loads (once, process-wide) the IPADIC dictionary from `dictPath`. If it is
+    // unavailable the tokenizer degrades to a per-codepoint split (logged), rather
+    // than failing index/query.
+    void initDict(const std::string& dictPath) override {


[Blocking] Missing or corrupt dictionary should not silently fall back for the production kuromoji parser. If indexing runs with dict_ == nullptr, segments are written with per-codepoint tokens; after the dictionary is installed/reloaded, query analyzers can produce real Kuromoji tokens, so old and new segments have different tokenization semantics.

Please fail analyzer creation/index/query with a clear error when the Kuromoji dictionary cannot be loaded. If fallback tokenization is desired, expose it as an explicit parser/mode persisted in index metadata.

Nice catch, you're right that the silent fallback is dangerous. I removed the fallback entirely, initDict now throws when the dictionary can't be loaded, and the tokenizer throws if it's ever handed a null dict.

airborne12 · 2026-06-29T07:32:07Z

+    const uint8_t* p = _system_map.data();
+    RETURN_IF_ERROR(check_header(p, _system_map.size(), KMJ_KIND_SYSTEM));
+    KmjSystemHeader s {};
+    std::memcpy(&s, p + sizeof(KmjFileHeader), sizeof(s));


[Blocking] check_header() only validates the common header, but the loader then trusts all sub-header offsets/counts and installs mmap pointers from them. A header-valid but truncated/corrupt *.bin can make these memcpy/pointer assignments and later accessors read past the mmap.

Please validate each sub-header before exposing pointers: sizeof(header + subheader), every offset + count * sizeof(T) range, integer overflow, trie byte alignment, class_count == CAT_CLASS_COUNT, matrix dimensions/cell count, run ranges into entry arrays, feature offsets, and word left/right IDs against matrix bounds. Return Status::Corruption for invalid artifacts.

Fixed. The loader now validates before exposing any pointer.

airborne12 · 2026-06-29T07:32:07Z

+            return INVERTED_INDEX_PARSER_SMART;
+        }
+        if (parser_it->second == INVERTED_INDEX_PARSER_KUROMOJI) {
+            return INVERTED_INDEX_PARSER_KUROMOJI_SEARCH;


[Major] BE now defaults omitted parser_mode for parser=kuromoji to search, but FE InvertedIndexProperties.getInvertedIndexParserMode() still only special-cases IK and otherwise returns coarse_grained. Match predicate thrift serialization uses the FE helper, so FE and BE disagree on the effective default.

Please update the FE default helper to return search for Kuromoji and add FE coverage for a Kuromoji index/query without an explicit parser_mode.

Good catch !!

Updated, InvertedIndexProperties.getInvertedIndexParserMode() to return search for kuromoji so it matches the BE default, and added FE tests for the kuromoji default (and explicit) mode.

airborne12 · 2026-06-29T07:32:07Z

+    if (mode == "extended") {
+        return KuromojiMode::Extended;
+    }
+    return KuromojiMode::Search; // default (matches OpenSearch/Lucene)


[Major] This silently maps any unknown Kuromoji mode to Search. That makes TOKENIZE(..., '"parser"="kuromoji","parser_mode"="bogus"') accepted and executed as search, while index DDL rejects the same value.

Please make mode parsing return an error/status for unknown values, and apply the same Kuromoji property validation to TOKENIZE as DDL/index creation.

Fixed. kuromoji_mode_from_string now rejects unknown values with an error (empty/search/normal/extended only). Since TOKENIZE builds its analyzer through the same create_builtin_analyzer path, I now rejects a bogus parser_mode just like DDL does. Aksi added a BE test for that as well.

airborne12 · 2026-06-29T07:32:07Z

+        sql """ INSERT INTO ${tableName} VALUES (3, "Apache Doris は高速です"); """
+        sql "sync"
+
+        // The kuromoji dictionary is not shipped in the p0 package, so the


[Blocking] This regression is proving the fallback path, not the feature added by this PR. Because the p0 package does not ship the dictionary, CI can pass while the real Kuromoji/IPADIC morphology path is never exercised.

Please add a required CI/regression path that generates/ships the dictionary and asserts real morphology behavior, e.g. search-mode decompound (東京都 matching 東京), base-form/POS behavior, and extended-mode unknown-word splitting. Keep fallback coverage separate if fallback remains explicit.

Right, with the dictionary now shipped and the fallback gone.

Updated, I rewrote test_japanese_analyzer to assert real morphology instead of unigrams.

Add a built-in `kuromoji` inverted-index parser that segments Japanese text into morphemes, mirroring the existing Chinese IK analyzer.

- Added `darts.h` to `.clang-format-ignore` and `.licenserc.yaml`. - Improved code formatting in various Kuromoji source files for better readability. - Updated tests files to include necessary headers.

…mposition - Added support for search mode in the Kuromoji Viterbi segmenter, applying penalties for long all-kanji and other tokens to enhance search recall. - Updated the KuromojiMode enumeration to reflect the new search and extended modes. - Modified the KuromojiTokenizer to utilize the new mode functionality. - Added unit tests to validate the behavior of the search mode, ensuring correct segmentation of compounds. - Updated NOTICE.txt to include Apache Lucene as a dependency for the kuromoji analyzer.

…wn words - Implemented functionality in the Kuromoji Viterbi segmenter to decompose unknown (out-of-vocabulary) words into per-character unigrams when in extended mode, aligning with Lucene's JapaneseTokenizer behavior. - Added unit tests to validate the correct segmentation of unknown words in both normal and extended modes, ensuring expected outputs for various input scenarios.

- Modified error messages to include 'kuromoji' parser in the parser mode validation. - Enhanced tests for the Japanese analyzer to assert expected tokenization results.

- Introduced a new configuration option `enable_kuromoji_analyzer` to toggle the Kuromoji analyzer functionality. - Updated unit tests to validate the behavior of the Kuromoji analyzer when enabled and disabled. - Modified tests to enable the Kuromoji analyzer for specific test cases.

- Updated the namespace for Kuromoji components from `doris::segment_v2::kuromoji` to `doris::segment_v2::inverted_index::kuromoji` across multiple files for better organization and clarity.

…ictionary

- Updated the CMake configuration to ensure the required Kuromoji dictionary files are present at build time, failing the build if any are missing. - Modified the KuromojiAnalyzer and KuromojiTokenizer to throw exceptions when the dictionary is not loaded, preventing silent fallbacks to per-codepoint tokenization. - Improved error handling and validation in the dictionary loading process to ensure robust operation. - Updated unit tests to validate the new behavior, ensuring that missing dictionaries trigger appropriate errors.

nishant94 · 2026-07-01T09:28:27Z

run buildall

hello-stephen · 2026-07-01T10:46:13Z

FE UT Coverage Report

Increment line coverage 44.44% (4/9) 🎉
Increment coverage report
Complete coverage report

Ryan19929 · 2026-07-02T07:08:45Z

Since this is modeled after Lucene Kuromoji, have you checked how the Doris implementation performs in practice? A small benchmark for indexing throughput would be helpful.

[Non-blocking] Viterbi hot path allocates per byte position; +16% measured with a small change

here is what I measured (Release build, single pipeline task, sql cache off, 50×1MB natural text, sum(length(TOKENIZE(...))), best of 4):

parser	corpus	50MB time	throughput
kuromoji (this PR)	ja	19.3 s	2.6 MB/s
kuromoji (prototype below)	ja	16.2 s	3.1 MB/s (+16%)
icu	ja	11.2 s	4.5 MB/s
ik	zh	11.0 s	4.5 MB/s
chinese	zh	6.7 s	7.5 MB/s

To be fair, the other rows are not apples-to-apples baselines: icu does much lighter work on Japanese than a full lattice/Viterbi morphological analysis, and ik/chinese run on a Chinese corpus, so some gap is expected and inherent to what kuromoji does. I'm only including them as a rough sense of scale — kuromoji is the slowest builtin analyzer but stays within the same order of magnitude, so I don't see this as blocking.

That said, a profile shows a good chunk of the time goes to avoidable heap allocations, so there are two cheap wins in KuromojiViterbi::segment():

std::vector<std::vector<int>> ending_at(n + 1) (kuromoji_viterbi.cpp:124): one vector object per document byte (~1M constructions for a 1MB doc) plus one heap allocation per reachable position.
matches declared inside the per-position loop (kuromoji_viterbi.cpp:170): one malloc/free per position; common_prefix_search() already clears it, so it can be hoisted.

Prototype of exactly these two changes: 19.3s → 16.2s, byte-identical tokenizer output on the 50MB corpus. The < → <= flip preserves the original tie-break (chain iterates newest-first, original vector oldest-first).

--- a/be/src/storage/index/inverted/analyzer/kuromoji/kuromoji_viterbi.cpp
+++ b/be/src/storage/index/inverted/analyzer/kuromoji/kuromoji_viterbi.cpp
@@ -121,18 +121,25 @@ void KuromojiViterbi::segment(std::string_view text, std::vector<KuromojiMorphem
     }
 
     std::vector<VNode> nodes;
-    std::vector<std::vector<int>> ending_at(n + 1); // node indices ending at each byte position
+    // Intrusive per-end-position chain: end_head[e] is the most recent node index
+    // ending at byte position e, end_next[i] links to the previous one. This avoids
+    // allocating n+1 std::vector objects per document.
+    std::vector<int32_t> end_head(n + 1, -1);
+    std::vector<int32_t> end_next;
 
     // BOS (index 0): ends at position 0, context id 0, zero cost.
     nodes.push_back(VNode {0, 0, 0, 0, 0, false, 0, 0, -1});
-    ending_at[0].push_back(0);
+    end_next.push_back(-1);
+    end_head[0] = 0;
 
     // Add a node and relax it against all nodes ending at its start position.
     auto add_node = [&](uint32_t s, uint32_t e, int16_t lid, int16_t rid, int16_t wcost, bool known,
                         uint32_t wid) {
         int64_t best = KMJ_INF;
         int best_prev = -1;
-        for (int pe : ending_at[s]) {
+        // Chain is iterated newest-first; "<=" keeps the oldest node on cost ties,
+        // matching the original insertion-order "<" selection exactly.
+        for (int pe = end_head[s]; pe >= 0; pe = end_next[pe]) {
             const VNode& pv = nodes[static_cast<std::size_t>(pe)];
             if (pv.total_cost >= KMJ_INF) {
                 continue;
@@ -140,7 +147,7 @@ void KuromojiViterbi::segment(std::string_view text, std::vector<KuromojiMorphem
             const int64_t c =
                     pv.total_cost + _dict.connection_cost(static_cast<uint32_t>(pv.right_id),
                                                           static_cast<uint32_t>(lid));
-            if (c < best) {
+            if (c <= best) {
                 best = c;
                 best_prev = pe;
             }
@@ -154,12 +161,15 @@ void KuromojiViterbi::segment(std::string_view text, std::vector<KuromojiMorphem
         const auto idx = static_cast<int>(nodes.size());
         nodes.push_back(
                 VNode {s, e, lid, rid, wcost, known, wid, best + wcost + penalty, best_prev});
-        ending_at[e].push_back(idx);
+        end_next.push_back(end_head[e]);
+        end_head[e] = idx;
     };
 
     uint32_t pos = 0;
+    // Reused across positions; common_prefix_search clears it on entry.
+    std::vector<KuromojiDictionary::PrefixMatch> matches;
     while (pos < n) {
-        if (ending_at[pos].empty()) {
+        if (end_head[pos] < 0) {
             pos += decode_utf8(text, pos).len; // unreachable boundary; skip
             continue;
         }
@@ -167,7 +177,6 @@ void KuromojiViterbi::segment(std::string_view text, std::vector<KuromojiMorphem
         const auto before = nodes.size();
 
         // System-dictionary words (common-prefix search).
-        std::vector<KuromojiDictionary::PrefixMatch> matches;
         _dict.common_prefix_search(text.data() + pos, n - pos, &matches);
         bool any_known = false;
         for (const auto& mt : matches) {
@@ -219,14 +228,14 @@ void KuromojiViterbi::segment(std::string_view text, std::vector<KuromojiMorphem
     // EOS: best node ending at n connected to the EOS context (id 0).
     int64_t best = KMJ_INF;
     int best_prev = -1;
-    for (int pe : ending_at[n]) {
+    for (int pe = end_head[n]; pe >= 0; pe = end_next[pe]) {
         const VNode& pv = nodes[static_cast<std::size_t>(pe)];
         if (pv.total_cost >= KMJ_INF) {
             continue;
         }
         const int64_t c =
                 pv.total_cost + _dict.connection_cost(static_cast<uint32_t>(pv.right_id), 0);
-        if (c < best) {
+        if (c <= best) {
             best = c;
             best_prev = pe;
         }

- Replaced the `ending_at` vector with `end_head` and `end_next` for better memory management and performance during node processing. - Updated node addition and traversal logic to utilize the new data structures, enhancing the segmenter's efficiency in handling word segmentation.

nishant94 · 2026-07-02T07:39:28Z

[Non-blocking] Viterbi hot path allocates per byte position; +16% measured with a small change

@Ryan19929 I am glad to see you spent time for benchmarking analyzers. Love to see these benchmarks. Also your optimization suggestion is pretty helpful. I took the refernce from it and swapped the ending_at vector-of-vectors for the intrusive end_head/end_next chain, and hoisted matches out of the per-position loop so common_prefix_search() reuses one buffer.

Thank you Ryan !!

nishant94 requested review from BiteTheDDDDt, airborne12 and zclllyybb as code owners June 22, 2026 05:53

nishant94 force-pushed the feat/kuromoji-japanese-analyzer branch from 389fcfb to b79db3c Compare June 22, 2026 09:57

morningman self-assigned this Jun 23, 2026

yiguolei reviewed Jun 24, 2026

View reviewed changes

Comment thread be/src/storage/index/inverted/analyzer/analyzer.cpp Outdated

nishant94 force-pushed the feat/kuromoji-japanese-analyzer branch from db0ee69 to 06b4ef6 Compare June 24, 2026 03:59

nishant94 requested a review from yiguolei June 24, 2026 05:18

yiguolei reviewed Jun 26, 2026

View reviewed changes

Comment thread be/dict/kuromoji/README.md

nishant94 requested a review from yiguolei June 27, 2026 06:43

airborne12 requested changes Jun 29, 2026

View reviewed changes

nishant94 force-pushed the feat/kuromoji-japanese-analyzer branch from f334a15 to 698dfb5 Compare July 1, 2026 09:19

nishant94 added 11 commits July 1, 2026 14:56

[feature](inverted-index) Add Japanese (Kuromoji) morphological analyzer

b511ba0

Add a built-in `kuromoji` inverted-index parser that segments Japanese text into morphemes, mirroring the existing Chinese IK analyzer.

add empty line on the end of .gitignore

dcf3029

Update Kuromoji analyzer files and formatting

498fa8f

- Added `darts.h` to `.clang-format-ignore` and `.licenserc.yaml`. - Improved code formatting in various Kuromoji source files for better readability. - Updated tests files to include necessary headers.

Enhance Japanese analyzer tests

da3b0d9

- Modified error messages to include 'kuromoji' parser in the parser mode validation. - Enhanced tests for the Japanese analyzer to assert expected tokenization results.

fix indentation issues

f2c84fb

Refactor Kuromoji namespace to inverted_index

fa4ccab

- Updated the namespace for Kuromoji components from `doris::segment_v2::kuromoji` to `doris::segment_v2::inverted_index::kuromoji` across multiple files for better organization and clarity.

Update README.md to include Apache License information for Kuromoji d…

e658c6a

…ictionary

nishant94 force-pushed the feat/kuromoji-japanese-analyzer branch from 698dfb5 to cb16636 Compare July 1, 2026 09:26

nishant94 requested a review from airborne12 July 1, 2026 09:28

Fix comment formatting

be6cb04

yiguolei previously approved these changes Jul 1, 2026

View reviewed changes

Merge branch 'master' into feat/kuromoji-japanese-analyzer

237e7c3

nishant94 dismissed yiguolei’s stale review via 237e7c3 July 2, 2026 03:54

Uh oh!

Conversation

nishant94 commented Jun 22, 2026

What problem does this PR solve?

Release note

Uh oh!

hello-stephen commented Jun 22, 2026

Uh oh!

nishant94 commented Jun 22, 2026

Uh oh!

yiguolei commented Jun 22, 2026

Uh oh!

nishant94 commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BiteTheDDDDt commented Jun 22, 2026

Uh oh!

nishant94 commented Jun 22, 2026

Uh oh!

hello-stephen commented Jun 22, 2026

FE UT Coverage Report

Uh oh!

nishant94 commented Jun 22, 2026

Uh oh!

hello-stephen commented Jun 22, 2026

BE UT Coverage Report

Uh oh!

hello-stephen commented Jun 22, 2026

BE UT Coverage Report

Uh oh!

hello-stephen commented Jun 22, 2026

BE Regression && UT Coverage Report

Uh oh!

hello-stephen commented Jun 22, 2026

FE Regression Coverage Report

Uh oh!

nishant94 commented Jun 23, 2026

Uh oh!

hello-stephen commented Jun 23, 2026

FE UT Coverage Report

Uh oh!

hello-stephen commented Jun 23, 2026

FE Regression Coverage Report

Uh oh!

hello-stephen commented Jun 23, 2026

BE UT Coverage Report

Uh oh!

hello-stephen commented Jun 23, 2026

BE Regression && UT Coverage Report

Uh oh!

Uh oh!

nishant94 commented Jun 24, 2026

Uh oh!

hello-stephen commented Jun 24, 2026

FE UT Coverage Report

Uh oh!

hello-stephen commented Jun 24, 2026

BE UT Coverage Report

Uh oh!

hello-stephen commented Jun 24, 2026

BE Regression && UT Coverage Report

Uh oh!

Uh oh!

nishant94 commented Jun 27, 2026

Uh oh!

hello-stephen commented Jun 27, 2026

FE UT Coverage Report

Uh oh!

hello-stephen commented Jun 27, 2026

BE Regression && UT Coverage Report

Uh oh!

hello-stephen commented Jun 27, 2026

FE Regression Coverage Report

Uh oh!

Ryan19929 commented Jun 29, 2026

Uh oh!

airborne12 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nishant94 commented Jun 22, 2026 •

edited

Loading

airborne12 left a comment •

edited

Loading