-
Notifications
You must be signed in to change notification settings - Fork 2.8k
feat(backend): add generic options support and HTML image handling modes #2011
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(backend): add generic options support and HTML image handling modes #2011
Conversation
…ove HTML image handling
…are set correctly
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
|
❌ DCO Check Failed Hi @Leg0shii, your pull request has failed the Developer Certificate of Origin (DCO) check. This repository supports remediation commits, so you can fix this without rewriting history — but you must follow the required message format. 🛠 Quick Fix: Add a remediation commitRun this command: git commit --allow-empty -s -m "DCO Remediation Commit for Leg0shii <dragonsaremyfavourite@gmail.com>
I, Leg0shii <dragonsaremyfavourite@gmail.com>, hereby add my Signed-off-by to this commit: 17013de368e5aff7b2f4c24abdc438710a8ce088
I, Leg0shii <dragonsaremyfavourite@gmail.com>, hereby add my Signed-off-by to this commit: c07b89be5d3b1f059c147e8c6de89f8b46ca1c99
I, Leg0shii <dragonsaremyfavourite@gmail.com>, hereby add my Signed-off-by to this commit: fe6557e149579ceced84f7ee47a06aecdbbfc152
I, Leg0shii <dragonsaremyfavourite@gmail.com>, hereby add my Signed-off-by to this commit: db974f855c57d7b05e4db625effe0c1c10e9ca29
I, Leg0shii <dragonsaremyfavourite@gmail.com>, hereby add my Signed-off-by to this commit: 6355b20b2e911181a07f50644152ad6cd82fb55e
I, Leg0shii <dragonsaremyfavourite@gmail.com>, hereby add my Signed-off-by to this commit: 125b33988a5d48a522b7f781a155174347b3845b"
git push🔧 Advanced: Sign off each commit directlyFor the latest commit: git commit --amend --signoff
git push --force-with-leaseFor multiple commits: git rebase --signoff origin/main
git push --force-with-leaseMore info: DCO check report |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesThis rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
…I, Leg0shii <dragonsaremyfavourite@gmail.com>, hereby add my Signed-off-by to this commit: 17013de368e5aff7b2f4c24abdc438710a8ce088I, Leg0shii <dragonsaremyfavourite@gmail.com>, hereby add my Signed-off-by to this commit: c07b89be5d3b1f059c147e8c6de89f8b46ca1c99I, Leg0shii <dragonsaremyfavourite@gmail.com>, hereby add my Signed-off-by to this commit: fe6557e149579ceced84f7ee47a06aecdbbfc152I, Leg0shii <dragonsaremyfavourite@gmail.com>, hereby add my Signed-off-by to this commit: db974f855c57d7b05e4db625effe0c1c10e9ca29I, Leg0shii <dragonsaremyfavourite@gmail.com>, hereby add my Signed-off-by to this commit: 6355b20b2e911181a07f50644152ad6cd82fb55eI, Leg0shii <dragonsaremyfavourite@gmail.com>, hereby add my Signed-off-by to this commit: 125b339 Signed-off-by: Leg0shii <dragonsaremyfavourite@gmail.com>
|
Is documentation and examples nessecary? If so, where would I add them? |
|
Hello @Leg0shii In terms of design:
Other technical aspects:
Further improvements, out of the scope of this task, and besides those that you already listed:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, see the comment above
Besides docstrings, no further documentation is needed in this PR. |
|
Hello, thank you for the feedback, I really appreciate it! |
@Leg0shii I will take it from there this week. Wish you a good recovery! |
|
Im sorry for the late reply. I have invited you @ceberam |
|
Waiting for this PR to be Merged @ceberam @dolfim-ibm |
@irajank Thanks for your interest in this feature. We had to do an extensive refactoring of the initial implementation on this PR, but we are on the testing face at the moment, so hopefully we will release it within this week. |
Hey thanks for speedy response. Looking forward for ASAP merge. |
|
@ceberam @dolfim-ibm Any idea when we can expect this PR to be merged? |
@punit1108 We expect to have it merged by the end of today |
Description
This PR enhances the DeclarativeDocumentBackend to support configurable backend options and significantly improves HTML image handling capabilities in Docling.
Key Changes:
Made DeclarativeDocumentBackend generic and configurable:
TBackendOptionsto support backend-specific optionsFormatOptionwith automatic defaultsIntroduced configurable image handling for HTML:
Enhanced image source support:
//example.com)Improved test infrastructure:
Usage Example:
Breaking Changes:
None - backward compatible with optional backend options.
Remaining Issues
resolve_source_to_streamreturns only the end portion of the URL (e.g., "about") from full URLs like https://www.website.com/section/about. This prevents the HTML backend from properly resolving relative image paths since the full base URL is needed for correct image downloading.<!-- 🖼️❌ Image not available. Please usePdfPipelineOptions(generate_picture_images=True)-->(This might occur for relative path images, svg images or other reasons when a image cant be opened)403 Client Error: Forbidden. Please comply with the User-Agent policy: https://meta.wikimedia.org/wiki/User-Agent_policyI believe that these issues fall outside of the scope of this PR and should be handled in a future PRs.
Issue resolved by this Pull Request:
Resolves #1963
Checklist: