Skip to content

Splitting markdown-formatted outlines in an odd way #14

@nick-youngblut

Description

@nick-youngblut

My markdown doc is structured as:

# header1

## header2

Some text

## header2 

Some more text


### Step 0: this is pre-planning step

* ⚠️ this is a warning
▶ the top line
    ▶ the next line
    ▶ the next line
    ▶ the next line
        1. numbered list
        1. numbered list

### Step 1: the first actual step

▶ the top line
    ▶ the next line
    ▶ the next line
    ▶ the next line
        1. numbered list
        1. numbered list

### Step 2: the second step

etc...

My code:

import semchunk
chunker = semchunk.chunkerify('gpt-4', chunk_size = 2000)
chunker(text)

I would expect the chunker to split by headers, when possible; however, the chunks generally END with a header.

An example chunk:

▶ the top line
    ▶ the next line
    ▶ the next line
    ▶ the next line
        1. numbered list
        1. numbered list

### Step 2: the second step

...instead of:

### Step 1: the first actual step
▶ the top line
    ▶ the next line
    ▶ the next line
    ▶ the next line
        1. numbered list
        1. numbered list

Any idea why this is happening?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions