Skip to content

Encoding issue with libxml 2.11.1, 2.11.2, 2.11.3 (OK with libxml 2.11.0) #111

@jcamiel

Description

@jcamiel

Hi,

I've a strange encoding issue started with libxml 2.11.1+, (released a week ago https://gitlab.gnome.org/GNOME/libxml2/-/tags) with libxml rust crate 0.3.2.

My sample:

  • I've the following html document <data>café</data>
  • I evaluate the following xpath expression normalize-space(//data).

Sample code:

use std::ffi::CStr;
use std::os::raw;
use libxml::parser::{Parser, ParserOptions};
use libxml::xpath::Context;

fn main() {
    let parser = Parser::default_html();
    let options = ParserOptions { encoding: Some("utf-8"), ..Default::default()};
    let data = "<data>café</data>";
    let doc = parser.parse_string_with_options(data, options).unwrap();

    let context = Context::new(&doc).unwrap();
    let result = context.evaluate("normalize-space(//data)").unwrap();

    assert_eq!(unsafe { *result.ptr }.type_, libxml::bindings::xmlXPathObjectType_XPATH_STRING);
    let value = unsafe { *result.ptr }.stringval;
    let value = value as *const raw::c_char;
    let value = unsafe { CStr::from_ptr(value) };
    let value = value.to_string_lossy();
    println!("{value}")
}

With libxml 2.11.0, the value printed is café, with libxml 2.11.1 the value printed is café:

  • With libxml 2.11.0:
$ export LIBXML2=/Users/jc/Documents/Dev/libxml/libxml2-2.11.0/lib/libxml2.2.dylib
$ cargo clean && cargo run
$ café
  • With libxml 2.11.3:
$ export LIBXML2=/Users/jc/Documents/Dev/libxml/libxml2-2.11.3/lib/libxml2.2.dylib
$ cargo clean && cargo run
$ café

I've the impression that the encoding value of ParserOptions is not evaluated correctly through the crate (note: to reproduce the bug, you've to use Parser::default_html() and not Parser::default())

To confirm this, I've tested the "equivalent" code in plain C with libxml 2.11.3:

#include <string.h>
#include <libxml/HTMLparser.h>
#include <libxml/xpath.h>

int main() {
    xmlDocPtr doc = NULL;
    xmlXPathContextPtr context = NULL;
    xmlXPathObjectPtr result = NULL;

    // <data>café</data> in utf-8:
    char data[] = (char[]) {0x3c, 0x64, 0x61, 0x74, 0x61, 0x3e, 0x63, 0x61, 0x66, 0xc3, 0xa9, 0x3c, 0x2f, 0x64, 0x61,
                            0x74, 0x61, 0x3e};
    doc = htmlReadMemory(data, strlen(data), NULL, "utf-8",
                         HTML_PARSE_RECOVER | HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING);

    // Creating result request
    context = xmlXPathNewContext(doc);
    result = xmlXPathEvalExpression((const unsigned char *) "normalize-space(//data)", context);
    if (result->type == XPATH_STRING) {
        printf("%s\n", result->stringval);
    }

    xmlXPathFreeObject(result);
    xmlXPathFreeContext(context);
    xmlFreeDoc(doc);
    return 0;
}
  • With libxml 2.11.0:
$ gcc -L/Users/jc/Documents/Dev/libxml/libxml2-2.11.0/lib -l xml2 test.c
$ ./a.out
$ café
  • With libxml 2.11.3:
$ gcc -L/Users/jc/Documents/Dev/libxml/libxml2-2.11.3/lib -l xml2 test.c
$ ./a.out
$ café

My suspision is in

pub fn parse_string_with_options<Bytes: AsRef<[u8]>>(

When I debug the following code:

   // Process encoding.
    let encoding_cstring: Option<CString> =
      parser_options.encoding.map(|v| CString::new(v).unwrap());
    let encoding_ptr = match encoding_cstring {
      Some(v) => v.as_ptr(),
      None => DEFAULT_ENCODING,
    };

    // Process url.
    let url_ptr = DEFAULT_URL;

If parser encoding is initialized with Some("utf-8"), encoding_ptr is not valid just before // Process url (it points to a null char).
So the call to the binding htmlReadMemory is made with no encoding... The unsafe part of the code is my Rust limit of understanding so I'm unable to see if there is something bad here. I hope my issue is clear, and, I should have started by this, thank you for your work on this crate !

Regards,

Jc

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions