Data and Tools

Overview

For this task, there will be three data sets: training, development, and test. On November 4, 2013, a sample (of some 190 dependency graphs) from the training data was published as trial data, demonstrating key characteristics of the task. Since December 13, some 750,000 tokens of annotated text are available as training data; please subscribe to the task mailing list for access information. Participants are free to use the training and development data in system development as they see fit, i.e. our splitting off part of the data released in mid-December as a development set is no more than a suggestion on best practises; in particular, it will be legitimate to train the final system, for submission to evaluation once the test data is released, on both the training and development parts.

Data Format

All data provided for this task will be in a format similar to the one used at the 2009 Shared Task of the Conference on Computational Language Learning (CoNLL), though with some simplifications. In a nutshell, our files are pre-tokenized, with one token per line. All sentences are terminated by an empty line (i.e. two consecutive newlines, including following the last sentence in each file). Each line is comprised of at least five tab-separated fields, i.e. annotations on tokens.

For ease of reference, each sentence is prefixed by a line that is not in tab-separated form and starts with the character # (ASCII number sign; U+0023), followed by a unique eight-digit identifier. Our sentence identifiers use the scheme 2SSDDIII, with a constant leading 2, two-digit section code, two-digit document code (within each section), and three-digit item number (within each document). For example, identifier 20200002 denotes the second sentence in the first file of Section 02, the classic Ms. Haag plays Elianti.

With one exception, our fields (i.e. columns in the tab-separated matrix) are a subset of the CoNLL 2009 inventory: (1) id, (2) form, (3) lemma, and (4) pos characterize the current token, with token identifiers starting from 1 within each sentence. Besides the lemma and part-of-speech information, in the closed track of our task, there is no explicit analysis of syntax. Across the three annotation formats in the task, fields (1) and (2) are aligned and uniform, i.e. all formats annotate exactly the same sentences. On the other hand, fields (3) and (4) are format-specific, i.e. there are different conventions for lemmatization, and part-of-speech assignments can vary (but all formats use the same PTB inventory of PoS tags).

The bi-lexical semantic depedency graph over tokens is represented by two or more columns starting with the obligatory fields (5) top and (6) pred. Both fields are binary-valued, i.e. possible values are ‘+’ (ASCII plus; U+002b) and ‘-’ (ASCII minus; U+002d). A positive value in the top column indicates that the node corresponding to this token is either a (semantic) head or a (structural) root of the graph; the exact linguistic interpretation of this property differs for our three formats, but note that top nodes can have incoming dependency edges. The pred column is a simplification of the corresponding field in earlier CoNLL tasks, indicating whether or not this token represents a predicate, i.e. a node with outgoing dependency edges. With these minor differences to the CoNLL tradition, our format can represent general, directed graphs, with designated top nodes. For example, there can be singleton nodes not connected to other parts of the graph (representing semantically vacuous tokens). In principle, there can be multiple tops, or a non-predicate top node, although in our actual task data, we anticipate that there will typically be one unique top.

To designate predicate–argument relations, there are as many additional columns as there are predicates in the graph (i.e. the number of tokens marked ‘+’ in the pred column); we will call these additional columns (7) arg₁, (8) arg₂, etc. These colums contain argument roles relative to the i-th predicate, i.e. a non-empty value in column arg₁ indicates that the current token is an argument of the (linearly) first predicate in the sentence. In this format, graph reentrancies will lead to one token receiving argument roles for multiple predicates (i.e. non-empty arg_i values in the same row). By convention, empty values are represented as ‘_’ (ASCII underscore; U+005F), which indicates that there is no argument relation for the current token, with regard to the i-th predicate represented by this column. Thus, all tokens of the same sentence must always have all argument columns filled in, even on non-predicate words; in other words, all lines making up one block of tokens will have the same number n of fields, but n can differ across sentences, depending on the count of internal graph nodes.

Following is an example for the sentence Ms. Haag plays Elianti.

`id`	`form`	`lemma`	`pos`	top	pred	`arg₁`	`arg₂`
#20200002
1	Ms.	Ms.	NNP	-	+	_	_
2	Haag	Haag	NNP	-	-	compound	ARG1
3	plays	play	VBZ	+	+	_	_
4	Elianti	Elianti	NNP	-	-	_	ARG2
5	.	.	.	-	-	_	_

In the training and development data, all columns are provided. In the test data, only columns (1) to (4) are pre-filled. Participating systems will be asked to add columns (5), (6), and upwards and submit their results for scoring.

Companion Data

In the open track of the task (see the evaluation rules for details), we expect participants to draw on additional tools or resources, beyond the training data provided—notably on syntactic parsing. To aid participation in the open track, and for potentially increased comparability of results, we will make available a set of ‘companion’ data files, providing syntactic analyses from state-of-the-art data-driven parsers. Once the task enters its evaluation phase, the same range and format of syntactic analyses will be provided as companion files for the test data.

We are still discussing exactly how many such syntactic views on our data to prepare, but we plan on providing at least one dependency and one phrase-structure view, i.e. (a) analyses from the parser of Bohnet & Nivre (2012), with bi-lexical syntactic dependencies in the so-called Stanford Basic scheme (de Marneffe, et al., 2006) and (b) PTB-style constituent trees as produced, for example, by the parsers of Charniak & Johnson (2005) and Petrov & Klein (2007). Our companion data will be distributed in a token-oriented, tab-separated form (very similar to formats used at previous CoNLL Shared Tasks on data-driven dependency parsing and semantic role labeling), which will be aligned at the sentence and token levels to our official training and test data files and, thus, can be viewed as augmenting these files with additional columns for explicit syntactic information.

Licensing and Distribution

Large parts of the data prepared for this task is derivative of the PTB and other resources distributed by the Linguistic Data Consortium (LDC). We have established an agreement with the LDC that will make it possible for all task participants to obtain our training, development, and test data free of charge (for use in connection to SemEval 2014), whether they are LDC members or not. Participants will need to enter a license agreement with the LDC (to be provided here in late November 2013), and will then be able to download the data. Please subscribe to the mailing list for this task, for further information on obtaining the task data.

Trial Data

We have prepared the first 20 documents from Section 00 of the PTB WSJ Corpus, instantiating the file format described above and the three types of semantic dependencies used in this task. This trial data has been available for public download since Monday, November 4, 2013.

Software Support

We are currently putting together a supporting SDP toolkit, essentially a reference implementation (in Java) of input and output to the task file format, some quantitative analysis of semantic dependency graphs, and the official task scorer.

Contact Info

Organizers

Dan Flickinger
Jan Hajič
Marco Kuhlmann
Yusuke Miyao
Stephan Oepen
Yi Zhang
Daniel Zeman

sdp-organizers@emmtee.net

Other Info

Announcements

[22-apr-14] Complete results (system submissions and official scores) as well as the gold-standard test data are now available for public download.

[31-mar-14] We have received submissions from nine teams; a draft summary of evaluation results has been emailed to participating teams.

[25-mar-14] We have posted some additional, task-specific instructions for how to submit system results to the SemEval evaluation; please make sure to follow these requirements carefully.

[22-mar-14] The test data (and corresponding ‘companion’ syntactic analyses, for use in the open track) are now available to registered participants; please see the task mailing list for details.

[08-mar-14] We have released a minor update to the companion archive, adding a handful of missing dependencies and fixing a problem in the file format.

[05-feb-14] We have posted the description of a baseline approach and experimental results on the suggested development sub-set of our training data (Section 20) on the evaluation page; on the same page, we have further specified the mechanics of submitting results to the evaluation.

[17-jan-14] Version 1.0 of the ‘companion’ data for the open track is now available, providing syntactic analyses (in phrase structure and bi-lexical dependency form) as overlays to our training data. Please see the file README.txt in the companion archive for details.

[13-jan-14] We are releasing an update to the training data today, making a number of minor improvements to the DM and PCEDT graphs; also, we are now providing an on-line interface to search and explore visually the target representations for this task. For details, please see our task-specific mailing list.

[12-dec-13] Some 750,000 tokens of WSJ text, annotated in our three semantic dependency formats will become available for download tomorrow. To obtain the data, prospective participants need to enter a no-cost evaluation license with the Linguistic Data Consortium (LDC). For access to the license form, please subscribe to our spam-protected mailing list. Next, we are working to prepare our syntactic ‘companion’ data (to serve as optional input in the open track), which we expect to release in early January.

[24-nov-13] Version 1.1. of the trial data is now available, adding missing lemma values and streamlining argument labels in the DM format, removing a handful of items that used to have empty graphs in PAS, and generally aligning all items at the level of individual tokens (leaving 189 sentences in our trial data). This last move means that all three formats now uniformly use single-character Unicode glyphs for quote marks, dashes, ellipses, and apostrophes (rather than multi-character LaTeX-style approxmiations, as were used in the original ASCII release of the text). Furthermore, we encourage all interested parties, including prospective participants, to subscribe to our spam-protected mailing list, where we will post updates a little more frequently than on the general task web site.

[07-nov-13] We have clarified the interpretation of the top column (and renamed it from the earlier root) and elaborated the discussion of graph properties in the various formats. We will continue to extend and revise the documentation on our three types of dependency graphs, but only announce such incremental changes here when they affect the data format.

[04-nov-13] A 198-sentence subset of what will be the training data has been released as trial data, to exemplify the file format and type of annotations available. Please do get in touch, in case you see anything suprising!

[28-oct-13] We are in the process of finalizing the task description, posting some example dependencies, and making available some trial data. For the time being, please consider these pages very much a work in progress, i.e. contents and form will be subject to refinement over the next few days
.

SemEval-2014 Task 8