简体   繁体   中英

Understand the word sense disambiguation data set format

I am trying to evaluate a WSD model using well-known WSD data set (SemEval, SensEval). But I am don't understand the format of the gold key text file.

seneval3.gold.key.txt

d000.s000.t000 man%1:18:00::
d000.s000.t001 say%2:32:01::
d000.s001.t000 peer%2:39:00::
d000.s001.t001 companion%1:18:00::
d000.s001.t002 bleary%5:00:00:indistinct:00
d000.s001.t003 eye%1:08:00::
d000.s002.t000 have%2:40:00::
d000.s002.t001 ready%5:00:01:available:00
d000.s002.t002 answer%1:04:00::
d000.s002.t003 much%3:00:00::
d000.s002.t004 surprise%1:12:00::
d000.s002.t005 fit%1:26:00::
d000.s002.t006 coughing%1:26:00::
d000.s003.t000 man%1:18:00::
d000.s003.t001 drunk%3:00:00::
d000.s003.t002 crazy%5:00:00:insane:00
d000.s004.t000 newfound%5:00:00:new:00

I know that in the first line d000.s000.t000 talking about the document #0 sentence #0 token #0 by looking at the data file.

senseval3.data.xml

<sentence id="d000.s000">
    <wf lemma="that" pos="DET">That</wf>
    <wf lemma="&apos;" pos="VERB">&apos;s</wf>
    <wf lemma="what" pos="PRON">what</wf>
    <wf lemma="the" pos="DET">the</wf>
    <instance id="d000.s000.t000" lemma="man" pos="NOUN">man</instance>
    <wf lemma="have" pos="VERB">had</wf>
    <instance id="d000.s000.t001" lemma="say" pos="VERB">said</instance>
    <wf lemma="." pos=".">.</wf>
</sentence>

But I don't know what is meant after % , for example 1:18:00:: for lemma man.

This answer is composed based on the comment given for this SO post .

The number sequence followed by % is the lex_index. Lex index composed as follows.

ss_type:lex_filenum:lex_id:head_word:head_id

More information is in the WordNet documentation .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM