简体   繁体   English

"PySpark,读取多行文件 (.sdf)"

[英]PySpark, read multiline file (.sdf)

What is the most efficient way to read a collection of sdf files?读取 sdf 文件集合的最有效方法是什么? sdf is a chemical table file, containing both 3D information about molecules but also properties of said molecule. sdf 是一个化学表文件,既包含有关分子的 3D 信息,也包含该分子的属性。 All of this information is stored in a multiline (gzipped) ASCII file.所有这些信息都存储在一个多行(gzipped)ASCII 文件中。 What I am struggling with is defining a custom file reader function that is able to interpret the custom subsection of each molecular entry.我正在努力定义一个自定义文件阅读器功能,该功能能够解释每个分子条目的自定义小节。 At this point I'm doubting if this is even the right approach.在这一点上,我怀疑这是否是正确的方法。

<Molecular-ID>
  -OEChem-10272110393D
 Schrodinger Suite 2021-1.
 32 34  0     0  0  0  0  0  0999 V2000
   31.1383   33.3647   21.1400 C   0  0  0  0  0  0  0  0  0  0  0  0
   30.7977   33.9390   19.9173 C   0  0  0  0  0  0  0  0  0  0  0  0
....
M  END
> <ShapeTanimoto>
0.6969

> <ColorTanimoto>
0.7854

> <TanimotoCombo>
1.7854

$$$$

In my opinion the most 'efficient' way is to use someone else's code, an existing library.在我看来,最“有效”的方法是使用别人的代码,一个现有的库。

The CDK can read SDF files, and collections thereof. CDK 可以读取 SDF 文件及其集合。 https:\/\/cdk.github.io\/<\/a> https:\/\/cdk.github.io\/<\/a>

The Rosetta Wiki gives examples of calling the CDK from Python. Rosetta Wiki 提供了从 Python 调用 CDK 的示例。 https:\/\/ctr.fandom.com\/wiki\/Chemistry_Toolkit_Rosetta_Wiki<\/a> https:\/\/ctr.fandom.com\/wiki\/Chemistry_Toolkit_Rosetta_Wiki<\/a>

"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM