Spring Web 带有波兰字符的客户端响应

Question

我正在使用 Spring WebClient 获取 html。响应包含波兰语字符，例如：ą、ę、ż 等。

调用服务后，我希望响应如下所示： <div>plan zajęć</div>

但实际的响应看起来像这样： <div>plan zaj�ć</div> - 这个符号替换了所有波兰语字符。

这是一个 WebClient bean 配置：

@Bean
WebClient webClient() {
    return WebClient.builder()
            .build();
}

这就是我的使用方式：

Optional<String> resp = webClient.get()
        .uri(uri)
        .retrieve()
        .bodyToMono(String.class)
        .blockOptional();

这是我正在尝试 web 抓取的页面链接： https://plan.polsl.pl/plan.php?winW=1000&winH=1000&type=0&id=343126158

我不知道要在 WebClient 配置中更改什么以获得预期的效果，所以我正在寻求帮助。

Answer 1

请展示您如何使用 WebClient。 我不知道波兰语字符，但很可能您的问题与响应的编码有关。

您可以尝试将字符集指定为UTF_8 ，看看是否有帮助

WebClient webClient = WebClient.create();
Mono<String> response = webClient.get()
    .uri(uri)
    .acceptCharset(StandardCharsets.UTF_8)
    .retrieve()
    .bodyToMono(String.class);
    
String responseString = response.block();

== 2023 年 1 月 2 日更新 ==

请注意， Java 字符串使用UTF-8编码。 这就是为什么我们试图请求 web 服务器向我们返回UTF-8编码的文档。 不幸的是，您上面指定的 web 服务器返回ISO-8859-2字符集，即使 WebClient 请求返回UTF-8字符集。 您必须自己将响应正文从ISO-8859-2转码为UTF-8字符集。 这是执行此操作的示例代码。 我用你的 web 服务器测试了它。

WebClient webClient = WebClient.create();
Mono<ByteArrayResource> responseBody = webClient.get()
    .uri(uri)
    .retrieve()
    .bodyToMono(ByteArrayResource.class);

String responseString = new String(responseBody.block().getByteArray(), Charset.forName("ISO-8859-2"));

如果您正在构建一个通用的 web 爬虫，而不是将上述代码硬编码为始终从ISO-8859-2转码为UTF-8 ，您将需要从 Content-Type header 获取字符集信息。大多数 web 服务器会告诉您媒体类型以及 Content-Type 中的字符集编码。 然后，您可以指定正确的字符集，而不是在上面的代码中硬编码ISO-8859-2 。 这是查找字符集的示例代码。

WebClient webClient = WebClient.create();

Mono<ClientResponse> response = webClient
    .get()
    .uri("http://example.com")
    .exchange();

response.map(res -> {
    String contentType = res.headers().contentType().get().toString();
    String charset = null;

    // parse the Content-Type header to extract the charset
    Matcher m = Pattern.compile("charset=([^;]+)").matcher(contentType);
    if (m.find()) {
        charset = m.group(1);
    }

    return charset;
});

不幸的是，您指定的 web 服务器也没有告诉您 Content-Type header 中的字符集。 在这种情况下，您可能需要查看响应中的其他地方以确定字符编码。

您可以检查的一个地方是 HTML 文档中元素的字符集属性。 某些 web 服务器在 HTML 文档中包含一个元素，该元素具有指定文档字符编码的字符集属性。 这就是我发现您指定的文档使用ISO-8859-2字符集的方式。

WebClient 没有从标签中提取字符集信息的简单方法，但您可以使用正则表达式来提取它。 这是示例代码

WebClient webClient = WebClient.create();

Mono<String> responseBody = webClient
    .get()
    .uri("http://example.com")
    .retrieve()
    .bodyToMono(String.class);

responseBody.map(html -> {
    String charset = null;

    // use a regular expression to extract the charset attribute from the <meta> element
    Matcher m = Pattern.compile("<meta[^>]+charset=[\"']?([^\"'>]+)[\"']?").matcher(html);
    if (m.find()) {
        charset = m.group(1);
    }

    return charset;
});

Spring Web 带有波兰字符的客户端响应

问题描述

1 个解决方案

解决方案1
0 2023-01-02 03:44:23

Spring Web 带有波兰字符的客户端响应

问题描述

1 个解决方案

解决方案1 0 2023-01-02 03:44:23

解决方案1
0 2023-01-02 03:44:23