简体   繁体   English

需要帮助来解析Java中的文件

[英]Need help parsing a File in Java

I am currently doing a small data structures project, and I am trying to get data on universities across the country; 我目前正在做一个小型数据结构项目,并且正在尝试获取全国大学的数据; and then do some data manipulation with them. 然后对它们进行一些数据操作。 I have found this data here: http://archive.ics.uci.edu/ml/machine-learning-databases/university/university.data 我在这里找到了这些数据: http : //archive.ics.uci.edu/ml/machine-learning-databases/university/university.data

BUT, the problem with this data is (and I quote from the website): "It is a LISP readable file with a few relevant functions at the end of the data file." 但是,此数据的问题是(我在网站上引用):“它是一个LISP可读文件,在数据文件的末尾具有一些相关功能。” I plan on taking this data and saving it as a .txt file. 我打算将这些数据保存为.txt文件。

The file looks a bit like: 该文件看起来有点像:

(def-instance Adelphi
      (state newyork)
      (control private)
      (no-of-students thous:5-10)
      (male:female ratio:30:70)
      (student:faculty ratio:15:1)
      (sat verbal 500)
      (sat math 475)
      (expenses thous$:7-10)
      (percent-financial-aid 60)
      (no-applicants thous:4-7)
      (percent-admittance 70)
      (percent-enrolled 40)
      (academics scale:1-5 2)
      (social scale:1-5 2)
      (quality-of-life scale:1-5 2)
      (academic-emphasis business-administration)
      (academic-emphasis biology))
(def-instance Arizona-State
      (state arizona)
      (control state)
      (no-of-students thous:20+)
      (male:female ratio:50:50)
      (student:faculty ratio:20:1)
      (sat verbal 450)
      (sat math 500)
      (expenses thous$:4-7)
      (percent-financial-aid 50)
      (no-applicants thous:17+)
      (percent-admittance 80)
      (percent-enrolled 60)
      (academics scale:1-5 3)
      (social scale:1-5 4)
      (quality-of-life scale:1-5 5)
      (academic-emphasis business-education)
      (academic-emphasis engineering)
      (academic-emphasis accounting)
      (academic-emphasis fine-arts))

      ......

The End Of the File:

(dfx def-instance (l)
  (tlet (instance (car l) f-list (cdr l))
    (cond ((or (null instance) (consp instance))
           (msg t instance " is not a valid instance name (must be an atom)"))
          (t (make:event instance)
             (push instance !instances)
             (:= (get instance 'features)
                 (tfor (f in f-list)
                   (when (cond ((or (atom f) (null (cdr f)))
                                (msg t f " is not a valid feature "
                                       "(must be a 2 or 3 item list)") nil)
                               ((consp (car f))
                                (msg t (car f) " is not a valid feature "
                                     "name (must be an atom)") nil)
                               ((and (cddr f) (consp (cadr f)))
                                (msg t (cadr f) " is not a valid feature "
                                     "role (must be an atom)") nil)
                               (t t)))
                   (save (cond ((equal (length f) 3)
                                (make:feature (car f) (cadr f) (caddr f)))
                               (t (make:feature (car f) 'value (cadr f)))))))
             instance))))

(set-if !instances nil)



(dex run-uniq-colleges (l n)
  (tfor (sc in l)
    (when (cond ((ge (length *events-added*) n))
                ((not (get sc 'duplicate))
                 (run-instance sc)
~                 (remprop sc 'features)
                 nil)
                (t (remprop sc 'features) nil)))
    (stop)))

The data I am mostly interested in is Number of students, Academic emphases, and School name. 我最感兴趣的数据是学生人数,学术重点和学校名称。

Any help is greatly appreciated. 任何帮助是极大的赞赏。

You can work on/use a Lisp file parser, or you can ignore the language it's written on and focus on the data. 您可以使用/使用Lisp文件解析器,也可以忽略它所使用的语言,而专注于数据。 You mentioned you need: 您提到您需要:

  • School name 学校名
  • Number of students 学生人数
  • Academic emphases 学术重点

You can grep the relevant keywords (def-instance, no-of-students, academic-emphasis), which would leave you with (based on your example): 您可以grep相关的关键字(定义实例,无学生,强调学业),这将给您带来帮助(根据您的示例):

(def-instance Adelphi
      (no-of-students thous:5-10)
      (academic-emphasis business-administration)
      (academic-emphasis biology))
(def-instance Arizona-State
      (no-of-students thous:20+)
      (academic-emphasis business-education)
      (academic-emphasis engineering)
      (academic-emphasis accounting)
      (academic-emphasis fine-arts))

Which simplifies writing a specific parser (def-instance is followed by the name, then all academic-emphasis and no-of-students before the next def-instance refer to the previously defined name) 这简化了编写特定解析器的过程(def-instance后跟名称,然后在下一个def-instance之前的所有学术重点和无学生参考先前定义的名称)

Have you though about running that Lisp file in a Lips interpreter for the Java VM ? 您是否已在Java VM的Lips解释器中运行该Lisp文件?

As an example, Armed Bear Common Lisp , which is cimpatible with JSR-223 would hapily parse your file. 例如,可以与JSR-223兼容的Armed Bear Common Lisp可以迅速解析您的文件。

And using JSR-233, you'll be able to access script-defined variables (like Adephi and other ones), like examples show. 使用JSR-233,您将能够访问脚本定义的变量(例如Adephi和其他变量),如示例所示。

EDIT From comment request, some complementary explanations (although it seems quite straightforward to me). 编辑根据评论请求,一些补充性的解释(尽管对我来说似乎很简单)。

So, suppose you have Armed bear Common Lisp in your classpath, and file is the absolute file name of your script (this example is heavily inspired by/borrowed from JSR-223 example ). 因此,假设您在类路径中有Armed bear Common Lisp,并且file是脚本的绝对文件名(此示例很大程度上受JSR-223示例的启发/借鉴)。

First, install script engine 一,安装脚本引擎

ScriptEngineManager scriptManager = new ScriptEngineManager();
scriptManager.registerEngineExtension("lisp", new AbclScriptEngineFactory());
ScriptEngine lispEngine = scriptManager.getEngineByExtension("lisp");

Then, load your script in script engine 然后,在脚本引擎中加载脚本

Object eval = lispEngine.eval(new FileReader(file));

Now, armed with one little debugger, go see what's in (I'm not courageous enought to install all the environment to do the job for you) 现在,配备一个小的调试器,看看其中有什么(我没有足够的勇气安装所有环境来为您完成工作)

If you're going to parse lisp, you need to be aware of 'the stack'. 如果要解析lisp,则需要注意“堆栈”。

When you encounter a ( , you push onto the stack. You're now in a new scope, one level above where you were before. 当遇到( ,将推入堆栈。您现在处于新的作用域中,比以前高一级。

Similiarly, when you encounter a ) you pop off the stack - finish that layer and go down a level. 类似地,当您遇到a )时,您会跳出堆栈-完成该层并下一层。

So in this case, you're at the empty state to start. 因此,在这种情况下,您将处于空白状态。 The first thing you encounter is the ( so now you're in the "define" state. (I just made that up. Call it whatever you want.) You encounter the def-instance token, and then the name of the university. You keep reading and you encounter another ( (Ignore whitespace, just parse tokens.) This puts you in the properties state. (I made that up too.) Since you're jumping from define to properties, it's okay to make your object now. Something like UnivData data = new UnivData(parsedToken) (Where parsedToken evaluates to "Adelphi". 您遇到的第一件事是(现在您处于“已定义”状态。)(我刚刚做了。可以随便叫它。)您遇到def-instance令牌,然后是大学的名称。您继续阅读,然后遇到另一个( (忽略空格,仅分析标记。)这使您进入属性状态。(我也补充了这一点。)由于您是从定义跳到属性的,所以现在就可以制作对象了。诸如UnivData data = new UnivData(parsedToken) (其中parsedToken的值为“ Adelphi”)。

Okay back to properties - you've read that first ( , then you read "state" and "newyork", and then another ) . 好的,回到属性-您已经阅读了( ,然后阅读了“ state”和“ newyork”,然后是另一个) So, you can assign the state variable of the current UnivData to newyork. 因此,您可以将当前UnivData的状态变量分配给newyork。

You repeat this behavior for all the properties, but then you encounter an additional ) after academic-emphasis. 您对所有属性都重复了此行为,但是在学术重点之后又遇到了一个额外的) This is your cue to close the current object and start looking for another one. 这是关闭当前对象并开始寻找另一个对象的提示。

At first, I was tempted to say use a Map. 起初,我很想说使用地图。 The fact that there are multiple academic-emphasis tokens indicates you should use a better datastructure, perhaps a Map>. 有多个学术重点标记的事实表明您应该使用更好的数据结构,也许是Map>。 It may even be better to roll your own Property class that has a String, or if it acquires multiple values, it switches to a list of strings. 最好自己滚动一个具有String的Property类,或者如果它获取多个值,它将切换到字符串列表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM