# Text _Path: en/lua/text/text_ ## Table of Contents - Text Processing ## Content # Text Processing Regular expressions, text diffing, and semantic text splitting. ## Loading ```lua local text = require("text") ``` ### Compile ```lua local re, err = text.regexp.compile("[0-9]+") ``` | Parameter | Type | Description | |-----------|------|-------------| | `pattern` | string | RE2 compatible regex pattern | **Returns:** `Regexp, error` ### Match ```lua local ok = re:match_string("abc123") ``` | Parameter | Type | Description | |-----------|------|-------------| | `s` | string | String to match | **Returns:** `boolean` ### Find ```lua local match = re:find_string("abc123def") ``` | Parameter | Type | Description | |-----------|------|-------------| | `s` | string | String to search | **Returns:** `string | nil` ### Find All ```lua local matches = re:find_all_string("a1b2c3") ``` | Parameter | Type | Description | |-----------|------|-------------| | `s` | string | String to search | **Returns:** `string[]` ### Find with Groups ```lua local match = re:find_string_submatch("user@example.com") ``` | Parameter | Type | Description | |-----------|------|-------------| | `s` | string | String to search | **Returns:** `string[] | nil` (full match + capture groups) ### Find All with Groups ```lua local matches = re:find_all_string_submatch("a=1 b=2") ``` | Parameter | Type | Description | |-----------|------|-------------| | `s` | string | String to search | **Returns:** `string[][]` ### Find Index ```lua local pos = re:find_string_index("abc123") ``` | Parameter | Type | Description | |-----------|------|-------------| | `s` | string | String to search | **Returns:** `table | nil` ({start, end}, 1-based) ### Find All Index ```lua local positions = re:find_all_string_index("a1b2c3") ``` | Parameter | Type | Description | |-----------|------|-------------| | `s` | string | String to search | **Returns:** `table[]` ### Replace ```lua local result = re:replace_all_string("a1b2", "X") ``` | Parameter | Type | Description | |-----------|------|-------------| | `s` | string | Input string | | `repl` | string | Replacement string | **Returns:** `string` ### Split ```lua local parts = re:split("a,b,c", -1) ``` | Parameter | Type | Description | |-----------|------|-------------| | `s` | string | String to split | | `n` | integer | Max parts, -1 for all | **Returns:** `string[]` ### Subexpression Count ```lua local count = re:num_subexp() ``` **Returns:** `number` ### Subexpression Names ```lua local names = re:subexp_names() ``` **Returns:** `string[]` ### Pattern String ```lua local pattern = re:string() ``` **Returns:** `string` ## Text Diffing Compare text versions and generate patches. Based on [go-diff](https://github.com/sergi/go-diff) (Google's diff-match-patch). ### Create Differ ```lua local diff, err = text.diff.new() local diff, err = text.diff.new(options) ``` **Returns:** `Differ, error` #### Options {id="diff-options"} | Field | Type | Default | Description | |-------|------|---------|-------------| | `diff_timeout` | number | 1.0 | Timeout in seconds | | `diff_edit_cost` | integer | 4 | Cost of an empty edit | | `match_threshold` | number | 0.5 | Match tolerance 0-1 | | `match_distance` | integer | 1000 | Distance to search for match | | `patch_delete_threshold` | number | 0.5 | Delete threshold | | `patch_margin` | integer | 4 | Context margin | ### Compare Find differences between two texts. Returns an array of operations describing how to transform text1 into text2. ```lua local diff, _ = text.diff.new() local diffs, err = diff:compare("hello world", "hello there") -- diffs contains: -- {operation = "equal", text = "hello "} -- {operation = "delete", text = "world"} -- {operation = "insert", text = "there"} ``` | Parameter | Type | Description | |-----------|------|-------------| | `text1` | string | Original text | | `text2` | string | Modified text | **Returns:** `table, error` (array of {operation, text}) Operations: `"equal"`, `"delete"`, `"insert"` ### Summarize Count characters changed between versions. ```lua local diffs, _ = diff:compare("hello world", "hello there") local summary = diff:summarize(diffs) -- summary.equals = 6 (characters unchanged) -- summary.deletions = 5 (characters removed) -- summary.insertions = 5 (characters added) ``` | Parameter | Type | Description | |-----------|------|-------------| | `diffs` | table | Diff array from compare | **Returns:** `table` ({insertions, deletions, equals}) ### Pretty Text Format diff with ANSI colors for terminal display. ```lua local formatted, err = diff:pretty_text(diffs) print(formatted) ``` | Parameter | Type | Description | |-----------|------|-------------| | `diffs` | table | Diff array from compare | **Returns:** `string, error` ### Pretty HTML Format diff as HTML with `` and `` tags. ```lua local html, err = diff:pretty_html(diffs) -- Returns: "hello worldthere" ``` | Parameter | Type | Description | |-----------|------|-------------| | `diffs` | table | Diff array from compare | **Returns:** `string, error` ### Create Patches Generate patches that can be applied to transform one text into another. Patches can be serialized and applied later. ```lua local text1 = "The quick brown fox jumps over the lazy dog" local text2 = "The quick red fox jumps over the lazy cat" local patches, err = diff:patch_make(text1, text2) ``` | Parameter | Type | Description | |-----------|------|-------------| | `text1` | string | Original text | | `text2` | string | Modified text | **Returns:** `table, error` ### Apply Patches Apply patches to transform text. Returns the result and whether all patches applied successfully. ```lua local result, success = diff:patch_apply(patches, text1) -- result = "The quick red fox jumps over the lazy cat" -- success = true ``` | Parameter | Type | Description | |-----------|------|-------------| | `patches` | table | Patches from patch_make | | `text` | string | Text to apply patches to | **Returns:** `string, boolean` ## Text Splitting Split large documents into smaller chunks while preserving semantic boundaries. Based on [langchaingo](https://github.com/tmc/langchaingo) text splitter. ### Recursive Splitter Splits text using a hierarchy of separators. First tries to split on double newlines (paragraphs), then single newlines, then spaces, then characters. Falls back to smaller separators when chunks exceed the size limit. ```lua local splitter, err = text.splitter.recursive({ chunk_size = 1000, chunk_overlap = 100 }) local long_text = "This is a long text that needs splitting..." local chunks, err = splitter:split_text(long_text) -- chunks = {"This is a long...", "...text that needs...", "...splitting..."} ``` **Returns:** `Splitter, error` #### Options {id="recursive-splitter-options"} | Field | Type | Default | Description | |-------|------|---------|-------------| | `chunk_size` | integer | 4000 | Max characters per chunk | | `chunk_overlap` | integer | 200 | Characters repeated between adjacent chunks | | `keep_separator` | boolean | false | Keep separators in output | | `separators` | string[] | nil | Custom separator list | ### Markdown Splitter Splits markdown documents while respecting structure. Tries to keep headings with their content, code blocks intact, and table rows together. ```lua local splitter, err = text.splitter.markdown({ chunk_size = 2000, code_blocks = true, heading_hierarchy = true }) local readme = fs.read("README.md") local chunks, err = splitter:split_text(readme) ``` **Returns:** `Splitter, error` #### Options {id="markdown-splitter-options"} | Field | Type | Default | Description | |-------|------|---------|-------------| | `chunk_size` | integer | 4000 | Max characters per chunk | | `chunk_overlap` | integer | 200 | Characters repeated between adjacent chunks | | `code_blocks` | boolean | false | Keep code blocks together | | `reference_links` | boolean | false | Preserve reference links | | `heading_hierarchy` | boolean | false | Respect heading levels | | `join_table_rows` | boolean | false | Keep table rows together | ### Split Text Split a single document into an array of chunks. ```lua local chunks, err = splitter:split_text(document) for i, chunk in ipairs(chunks) do -- Process each chunk (e.g., create embedding, send to LLM) process(chunk) end ``` | Parameter | Type | Description | |-----------|------|-------------| | `text` | string | Text to split | **Returns:** `string[], error` ### Split Batch Split multiple documents while preserving their metadata. Each input document can produce multiple output chunks. All chunks inherit the metadata from their source document. ```lua -- Input: pages from a PDF with page numbers local pages = { {content = "First page content...", metadata = {page = 1}}, {content = "Second page content...", metadata = {page = 2}} } local chunks, err = splitter:split_batch(pages) -- Output: each chunk knows which page it came from for _, chunk in ipairs(chunks) do print("Page " .. chunk.metadata.page .. ": " .. chunk.content:sub(1, 50)) end ``` | Parameter | Type | Description | |-----------|------|-------------| | `pages` | table | Array of {content, metadata} | **Returns:** `table, error` (array of {content, metadata}) ## Errors | Condition | Kind | Retryable | |-----------|------|-----------| | Invalid pattern syntax | `errors.INVALID` | no | | Internal error | `errors.INTERNAL` | no | See [Error Handling](lua/core/errors.md) for working with errors. ## Navigation Previous: TTY (lua/system/tty) Next: Template (lua/text/template)