-
Notifications
You must be signed in to change notification settings - Fork 44
Description
Hi,
I've a strange encoding issue started with libxml 2.11.1+, (released a week ago https://gitlab.gnome.org/GNOME/libxml2/-/tags) with libxml rust crate 0.3.2.
My sample:
- I've the following html document
<data>café</data>
- I evaluate the following xpath expression
normalize-space(//data)
.
Sample code:
use std::ffi::CStr;
use std::os::raw;
use libxml::parser::{Parser, ParserOptions};
use libxml::xpath::Context;
fn main() {
let parser = Parser::default_html();
let options = ParserOptions { encoding: Some("utf-8"), ..Default::default()};
let data = "<data>café</data>";
let doc = parser.parse_string_with_options(data, options).unwrap();
let context = Context::new(&doc).unwrap();
let result = context.evaluate("normalize-space(//data)").unwrap();
assert_eq!(unsafe { *result.ptr }.type_, libxml::bindings::xmlXPathObjectType_XPATH_STRING);
let value = unsafe { *result.ptr }.stringval;
let value = value as *const raw::c_char;
let value = unsafe { CStr::from_ptr(value) };
let value = value.to_string_lossy();
println!("{value}")
}
With libxml 2.11.0, the value printed is café
, with libxml 2.11.1 the value printed is café
:
- With libxml 2.11.0:
$ export LIBXML2=/Users/jc/Documents/Dev/libxml/libxml2-2.11.0/lib/libxml2.2.dylib
$ cargo clean && cargo run
$ café
- With libxml 2.11.3:
$ export LIBXML2=/Users/jc/Documents/Dev/libxml/libxml2-2.11.3/lib/libxml2.2.dylib
$ cargo clean && cargo run
$ café
I've the impression that the encoding
value of ParserOptions
is not evaluated correctly through the crate (note: to reproduce the bug, you've to use Parser::default_html()
and not Parser::default()
)
To confirm this, I've tested the "equivalent" code in plain C with libxml 2.11.3:
#include <string.h>
#include <libxml/HTMLparser.h>
#include <libxml/xpath.h>
int main() {
xmlDocPtr doc = NULL;
xmlXPathContextPtr context = NULL;
xmlXPathObjectPtr result = NULL;
// <data>café</data> in utf-8:
char data[] = (char[]) {0x3c, 0x64, 0x61, 0x74, 0x61, 0x3e, 0x63, 0x61, 0x66, 0xc3, 0xa9, 0x3c, 0x2f, 0x64, 0x61,
0x74, 0x61, 0x3e};
doc = htmlReadMemory(data, strlen(data), NULL, "utf-8",
HTML_PARSE_RECOVER | HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING);
// Creating result request
context = xmlXPathNewContext(doc);
result = xmlXPathEvalExpression((const unsigned char *) "normalize-space(//data)", context);
if (result->type == XPATH_STRING) {
printf("%s\n", result->stringval);
}
xmlXPathFreeObject(result);
xmlXPathFreeContext(context);
xmlFreeDoc(doc);
return 0;
}
- With libxml 2.11.0:
$ gcc -L/Users/jc/Documents/Dev/libxml/libxml2-2.11.0/lib -l xml2 test.c
$ ./a.out
$ café
- With libxml 2.11.3:
$ gcc -L/Users/jc/Documents/Dev/libxml/libxml2-2.11.3/lib -l xml2 test.c
$ ./a.out
$ café
My suspision is in
Line 292 in a10a5a6
pub fn parse_string_with_options<Bytes: AsRef<[u8]>>( |
When I debug the following code:
// Process encoding.
let encoding_cstring: Option<CString> =
parser_options.encoding.map(|v| CString::new(v).unwrap());
let encoding_ptr = match encoding_cstring {
Some(v) => v.as_ptr(),
None => DEFAULT_ENCODING,
};
// Process url.
let url_ptr = DEFAULT_URL;
If parser encoding is initialized with Some("utf-8"), encoding_ptr
is not valid just before // Process url
(it points to a null char).
So the call to the binding htmlReadMemory
is made with no encoding... The unsafe part of the code is my Rust limit of understanding so I'm unable to see if there is something bad here. I hope my issue is clear, and, I should have started by this, thank you for your work on this crate !
Regards,
Jc