Data and Tools
Overview
For this task, there will be three data sets: training, development, and test. On November 4, 2013, a sample (of some 190 dependency graphs) from the training data was published as trial data, demonstrating key characteristics of the task. Since December 13, some 750,000 tokens of annotated text are available as training data; please subscribe to the task mailing list for access information. Participants are free to use the training and development data in system development as they see fit, i.e. our splitting off part of the data released in mid-December as a development set is no more than a suggestion on best practises; in particular, it will be legitimate to train the final system, for submission to evaluation once the test data is released, on both the training and development parts.
Data Format
All data provided for this task will be in a format similar to the one used at the 2009 Shared Task of the Conference on Computational Language Learning (CoNLL), though with some simplifications. In a nutshell, our files are pre-tokenized, with one token per line. All sentences are terminated by an empty line (i.e. two consecutive newlines, including following the last sentence in each file). Each line is comprised of at least five tab-separated fields, i.e. annotations on tokens.
For ease of reference, each sentence is prefixed by a line that is not in tab-separated form and starts with the character # (ASCII number sign; U+0023), followed by a unique eight-digit identifier. Our sentence identifiers use the scheme 2SSDDIII, with a constant leading 2, two-digit section code, two-digit document code (within each section), and three-digit item number (within each document). For example, identifier 20200002 denotes the second sentence in the first file of Section 02, the classic Ms. Haag plays Elianti.
With one exception, our fields (i.e. columns in the tab-separated matrix) are a subset of the CoNLL 2009 inventory: (1) id, (2) form, (3) lemma, and (4) pos characterize the current token, with token identifiers starting from 1 within each sentence. Besides the lemma and part-of-speech information, in the closed track of our task, there is no explicit analysis of syntax. Across the three annotation formats in the task, fields (1) and (2) are aligned and uniform, i.e. all formats annotate exactly the same sentences. On the other hand, fields (3) and (4) are format-specific, i.e. there are different conventions for lemmatization, and part-of-speech assignments can vary (but all formats use the same PTB inventory of PoS tags).
The bi-lexical semantic depedency graph over tokens is represented by two or more columns starting with the obligatory fields (5) top and (6) pred. Both fields are binary-valued, i.e. possible values are ‘+’ (ASCII plus; U+002b) and ‘-’ (ASCII minus; U+002d). A positive value in the top column indicates that the node corresponding to this token is either a (semantic) head or a (structural) root of the graph; the exact linguistic interpretation of this property differs for our three formats, but note that top nodes can have incoming dependency edges. The pred column is a simplification of the corresponding field in earlier CoNLL tasks, indicating whether or not this token represents a predicate, i.e. a node with outgoing dependency edges. With these minor differences to the CoNLL tradition, our format can represent general, directed graphs, with designated top nodes. For example, there can be singleton nodes not connected to other parts of the graph (representing semantically vacuous tokens). In principle, there can be multiple tops, or a non-predicate top node, although in our actual task data, we anticipate that there will typically be one unique top.
To designate predicate–argument relations, there are as many additional columns as there are predicates in the graph (i.e. the number of tokens marked ‘+’ in the pred column); we will call these additional columns (7) arg1, (8) arg2, etc. These colums contain argument roles relative to the i-th predicate, i.e. a non-empty value in column arg1 indicates that the current token is an argument of the (linearly) first predicate in the sentence. In this format, graph reentrancies will lead to one token receiving argument roles for multiple predicates (i.e. non-empty argi values in the same row). By convention, empty values are represented as ‘_’ (ASCII underscore; U+005F), which indicates that there is no argument relation for the current token, with regard to the i-th predicate represented by this column. Thus, all tokens of the same sentence must always have all argument columns filled in, even on non-predicate words; in other words, all lines making up one block of tokens will have the same number n of fields, but n can differ across sentences, depending on the count of internal graph nodes.
Following is an example for the sentence Ms. Haag plays Elianti.
id | form | lemma | pos | top | pred | arg1 | arg2 |
---|---|---|---|---|---|---|---|
#20200002 | |||||||
1 | Ms. | Ms. | NNP | - | + | _ | _ |
2 | Haag | Haag | NNP | - | - | compound | ARG1 |
3 | plays | play | VBZ | + | + | _ | _ |
4 | Elianti | Elianti | NNP | - | - | _ | ARG2 |
5 | . | . | . | - | - | _ | _ |
In the training and development data, all columns are provided. In the test data, only columns (1) to (4) are pre-filled. Participating systems will be asked to add columns (5), (6), and upwards and submit their results for scoring.
Companion Data
In the open track of the task (see the evaluation rules for details), we expect participants to draw on additional tools or resources, beyond the training data provided—notably on syntactic parsing. To aid participation in the open track, and for potentially increased comparability of results, we will make available a set of ‘companion’ data files, providing syntactic analyses from state-of-the-art data-driven parsers. Once the task enters its evaluation phase, the same range and format of syntactic analyses will be provided as companion files for the test data.
We are still discussing exactly how many such syntactic views on our data to prepare, but we plan on providing at least one dependency and one phrase-structure view, i.e. (a) analyses from the parser of Bohnet & Nivre (2012), with bi-lexical syntactic dependencies in the so-called Stanford Basic scheme (de Marneffe, et al., 2006) and (b) PTB-style constituent trees as produced, for example, by the parsers of Charniak & Johnson (2005) and Petrov & Klein (2007). Our companion data will be distributed in a token-oriented, tab-separated form (very similar to formats used at previous CoNLL Shared Tasks on data-driven dependency parsing and semantic role labeling), which will be aligned at the sentence and token levels to our official training and test data files and, thus, can be viewed as augmenting these files with additional columns for explicit syntactic information.
Licensing and Distribution
Large parts of the data prepared for this task is derivative of the PTB and other resources distributed by the Linguistic Data Consortium (LDC). We have established an agreement with the LDC that will make it possible for all task participants to obtain our training, development, and test data free of charge (for use in connection to SemEval 2014), whether they are LDC members or not. Participants will need to enter a license agreement with the LDC (to be provided here in late November 2013), and will then be able to download the data. Please subscribe to the mailing list for this task, for further information on obtaining the task data.
Trial Data
We have prepared the first 20 documents from Section 00 of the PTB WSJ Corpus, instantiating the file format described above and the three types of semantic dependencies used in this task. This trial data has been available for public download since Monday, November 4, 2013.
Software Support
We are currently putting together a supporting SDP toolkit, essentially a reference implementation (in Java) of input and output to the task file format, some quantitative analysis of semantic dependency graphs, and the official task scorer.