简体   繁体   English

从Fortran格式解析Rust中的浮动

[英]Parsing floats in Rust from Fortran formats

I'm rewriting a C++ parser in Rust for a legacy ASCII data format. 我正在用Rust重写一个C ++解析器,用于传统的ASCII数据格式。 Real number values in this format are permitted to be stored in any Fortran recognized format. 允许以任何Fortran识别的格式存储此格式的实数值。 Unfortunately, Fortran recognizes some formats not recognized by Rust (or most other languages). 不幸的是,Fortran识别出一些Rust(或大多数其他语言)无法识别的格式。 For example, the value 101.01 might be represented as 例如,值101.01可能表示为

  • 101.01 101.01
  • 1.0101E2 1.0101E2
  • 101.01e0 101.01e0
  • 101.01D0 101.01D0
  • 101.01d0 101.01d0
  • 101.01+0 101.01 + 0
  • 1010.1-1 1010.1-1

The first three are all natively recognized by Rust. 前三个都是Rust本地认可的。 The remaining four pose a challenge. 其余四个构成挑战。 In C++, we use the following routine to parse these values: 在C ++中,我们使用以下例程来解析这些值:

double parse(const std::string& s){
  char* p;
  const double significand = strtod(&s[0], &p);
  const long exponent = (*p == '\0') ? 
                          0 : isalpha(*p) ?
                            strtol(p+1, nullptr) :
                              strtol(p, nullptr);
  return significand * pow(10, exponent);
}

After reviewing the Rust documentation, it doesn't appear that the standard library offers partial string parsing in the vein of strtod and strtol . 在查看Rust文档之后,标准库似乎没有提供strtodstrtol部分字符串解析。 I'd like to avoid taking multiple passes over the string or using regular expressions for performance reasons. 为了性能原因,我想避免对字符串进行多次传递或使用正则表达式。

This would have been a comment to Veedrac's answer, but it got a bit long for a comment. 这本来是对Veedrac的答案的评论,但是评论有点长。

As Veedrac explains, parsing floats accurately is hard . 正如Veedrac解释的那样,准确地解析浮点数很难 The implementation in the standard library is completely accurate and reasonably well optimized. 标准库中的实现是完全准确的并且相当好地优化。 In particular, it's not much slower than the naive inaccurate algorithm for most inputs where the naive algorithm works. 特别是,它并不比天真不准确的算法,其中天真的算法工作最投入慢得多 You should use it. 你应该使用它。 Full disclaimer: I wrote it. 完全免责声明:我写了。

Where I disagree with Veedrac is how to proceed if you want to reuse that code. 如果您想重用该代码,我不同意Veedrac的意思是如何继续。 Ripping it out from the standard library is a bad idea. 从标准库中删除它是一个坏主意。 It's huge, about 2.5k lines of code, and it still changes/is improved occasionally — although rarely and mostly in very minor ways. 它是巨大的,大约2.5k行代码,它仍然会偶尔改变/改进 - 虽然很少,而且大多数都是非常小的方式。 But one day I'll find the time to rewrite the slow path to be better and faster, promised. 但有一天,我会找到时间重写慢速路径,以便更好更快地兑现承诺。 If you rip out this code, you would have to take the core::num::dec2flt module and modify the parse submodule to recognize other exponents. 如果您删除此代码,则必须使用core::num::dec2flt模块并修改parse子模块以识别其他指数。 Of course then you won't automatically benefit from future improvements, which is a shame if you're interested in performance. 当然,您不会自动受益于未来的改进,如果您对性能感兴趣,这将是一种耻辱。

The sanest way would be translate the other formats to the format supported by Rust. 最安静的方法是将其他格式转换为Rust支持的格式。 If it's a d , D or a bare + you can simply replace it with an e and pass it on to string . 如果它是dD或裸+你可以简单地用e替换它并将其传递给字符串。 Only in the case 1010.1-1 you will need to insert an e and shift the exponent part of the string. 仅在情况1010.1-1您需要插入e并移动字符串的指数部分。 This should not cost much performance. 这不应该花费太多的性能。 Float strings are short (at most 20 or so bytes, often much less) and the actual conversion work does a good chunk of work per byte. 浮点字符串很短(最多20个字节,通常少得多),实际的转换工作每个字节有很多工作量。 This is true for your C++ code as well, because strtod is accurate in glibc too . 对于您的C ++代码也是如此,因为strtod在glibc中也是准确的 Or at least it's trying to be, it can't fix the ad hoc algorithm built around it. 或者至少它试图成为,它无法修复围绕它构建的ad hoc算法。 In any case, it is trying to . 无论如何,它正在努力。

Another possibility is to use FFI to call C's strtod . 另一种可能性是使用FFI来调用C的strtod Use the libc crate and call libc::strtod . 使用libc crate并调用libc::strtod This requires some contortions to translate from &str to raw pointers to c_char , and it will handle interior 0 bytes badly, but the code you show is not terribly robust anyway. 这需要一些扭曲从&str转换为原始指针到c_char ,并且它将严重处理内部0字节,但是你显示的代码无论如何都不是非常强大。 This would allow you to translate your algorithm to Rust with identical performance and semantics and (in)accuracy. 这将允许您将算法转换为Rust,具有相同的性能和语义以及(in)准确性。

Your example in C++ does not give perfectly accurate results, but Rust's float parsing is intended to be perfectly accurate, and as such has slower parsing than you might need . 您在C ++中的示例并未提供完全准确的结果,但Rust的浮点解析旨在完全准确,因此解析速度比您可能需要的慢

If you implement approximate parsing manually, it will likely come out a faster than any other technique available. 如果您手动实现近似解析,它可能比任何其他可用技术更快。 A quick test I did locally suggests you can easily get a factor of 5 over the performance of the standard library's parse method. 我在本地进行的快速测试表明,您可以轻松地获得比标准库的parse方法性能高5倍的因子。

If you rather wish to have exact parsing, your C++ code is insufficient. 如果您希望进行精确的解析,那么您的C ++代码就不够用了。 A pre-parse (eg. with Regex) is likely the easiest way to do this, but alternatively you can rip out the code from the standard library and modify that. 预解析(例如,使用Regex)可能是最简单的方法,但也可以从标准库中删除代码并修改它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM