Evaluation
Evaluation Set-Up
Systems participating in the task will be evaluated based on the accuracy with which they can produce semantic dependency graphs for previously unseen text, measured relative to the gold-standard testing data. The key measures for this evaluation will be labeled and unlabeled precision and recall with respect to predicted dependencies (predicate–role–argument triples) and labeled and unlabeled exact match with respect to complete semantic dependency graphs. In both contexts, identification of the top node(s) of a graph will be considered as the identification of additional, ‘virtual’ triples. Below and in other task-related contexts, we will abbreviate these metrics as (a) labeled precision, recall, and F1: LP, LR, LF; (b) unlabeled precision, recall, and F1: UP, UR, UF; and (c) labeled and unlabeled exact match: LM, UM.
In addition to these metrics, which have already been implemented in SDP 2014 (the first incarnation of this Task), we will define two additional metrics that aim to capture fragments of semantics that are ‘larger’ than individual dependencies but ‘smaller’ than the semantic dependency graph for the complete sentence, viz. what we call (a) complete predications and (b) semantic frames. In the SDP 2015 context, a complete predication is comprised of the set of all core arguments to one predicate, which for the DM and PAS target representations corresponds to all outgoing dependency edges, and for the PSD target representation to only those outgoing dependencies marked by an ‘-arg’ suffix on the edge label.
Pushing the units of evaluation one step further towards unit of interpretation, a semantic frame is comprised of a complete predication combined with the sense (or frame) identifier of its predicate. Both complete-predicate and semantic-frame evaluation will be restricted to predicates corresponding to major parts of speech (verbs, probably also nouns and adjectives, and possibly specialized phenomena like possessives), and semantic frames will be further restricted to those target representations and lexical categories for which sense information is available in our data (DM and PSD, with PSD senses limited to verbs). As with the per-dependency evaluation, we will score precision, recall, and F1—which we abbreviate as PP, PR, and PF for complete predications, and FP, FR, and FF for semantic frames.
More practically speaking, as the task enters the evaluation period in mid-December, we will make available three copies of the test data, one for each target annotation, in the same token-oriented, tab-separated format as the training data, but with only columns (1) id, (2) form, (3) lemma, and (4) pos pre-filled. Participating teams are expected to fill in the remaining columns (i.e. actual semantic dependency graphs) and submit the resulting files (one per target format) by December 20, 2014. Even though our three target representations annotate the exact same text (i.e. are sentence- and token-aligned), we provide three instances of the test data, as there may be variation in lemmatization and PoS assignment (unlike PAS and PSD, the DM annotations did not build on gold-standard PoS tags from the PTB). For the open and gold tracks (see below), we will further make available with the test data the same range of ‘companion’ syntactic analyses as are provided for the training data.
Sub-Tasks
For all three target formats, there will be three sub-tasks: a closed track, an open track, and a gold track. Systems participating in the closed track can only be trained on the gold-standard semantic dependencies distributed for the task. Systems participating in the open track may use additional resources, such as a syntactic parser, for example. Test data for our task will draw on Section 21 of the WSJ Corpus, and therefore participants must make sure to not use any tools or resources that encompass knowledge of the gold-standard syntactic or semantic analyses of this section, i.e. are directly or indirectly trained or otherwise derived from WSJ Section 21. Note that this restriction implies that off-the-shelf syntactic parsers may need to be re-trained, as many data-driven parsers for English include this section in their default training data. To simplify participation in the open track, in mid-August 2014, we will make available syntactic analyses from several state-of-the-art parsers (re-trained without use of WSJ Section 21) as optional ‘companion’ data files; please see the data overview page for details. Finally, the goal of the gold track is to more directly investiigate the contributions of syntactic structure on the semantic dependency parsing problem. for submissions to this track, we will make available (by the end of August 2014) gold-standard syntactic analyses in a variety of formats, including those used natively by the annotation initiatives from which our semantic dependency graphs derive, viz. HPSG derivation trees reduced to bi-lexical dependencies (for DM and PAS) and Prague analytical trees (for PSD).
Multiple Runs
Each participating team will be allowed to submit up to two different runs of their system (for each target format and, where applicable, both the closed and open tracks). Separate runs could, for example, reflect different parameter settings or other relatively minor variation in the configuration of the system used to produce the submitted results. Where genuinely different approaches are pursued within one team, i.e. separate systems that build on different methods, it may be legitimate to split the team, i.e. have two separate ‘teams’ from one site. Please contact the organizers (at the email address indicated in the right column) if you feel your site might want to register as multiple teams.
Final Scoring
The ‘official’ ranking of participating systems, in both the closed and the open tracks, will be determined based on the arithmetic mean of the labeled dependency F1 scores (i.e. the geometric mean of labeled precision and labeled recall) on the three target representations (DM, PAS, and PCEDT). Thus, to be considered for the final ranking, a system must submit semantic dependencies for all three target representations.
Software Support
Towards the end of August 2014, we will make available to participants the official scorer as part of the emerging SDP toolkit.
Baseline
As a common point of reference, the organizers have prepared a simple baseline system, building on techniques from data-driven syntactic dependency parsing. In a nutshell, we reduced the SDP graphs to trees. First, we eliminated re-entrancies in the graph by removing dependencies to nodes with multiple incoming edges, i.e. those that are the argument of more than one predicate. Of these edges, we kept the dependency on the ‘closest’ predicate, as defined in terms of surface distance (with a preference for leftward predicates over rightward ones, in case of ties by distance). Second, we trivially incorporated all singleton nodes into the tree, by attaching nodes with neither incoming nor outgoing edges to the immediately following node, or to a virtual new root node (token ‘0’) in case a sentence-final node was a singleton; these synthesized dependencies, we called ‘_null_’. Finally, we integrated all fragments into one tree, by subordinating any remaining node without incoming edges to the root node, using a new dependency type called ‘_root_’.
Following our recommended split of the training data, we then trained the graph-based parser of Bohnet (2010) on Sections 00–19 of the (tree reduction of our) SDP data, and applied the resulting ‘syntactic’ parsing model to Section 20. The table below indicates parser performance for our three target formats, evaluated both (a) at the level of Labeled and Unlabeled Attachment Scores (LAS and UAS, respectively; as computed by MaltEval), and (b) in terms of our SDP graph metrics, where for the latter the synthesized dependencies and any dependencies on the virtual root node were suppressed. Note that this baseline makes no attempt at predicting top nodes, but in keeping with our ‘official’ metrics for this task, our figures for LP, LR and, LF include the virtual edges to top nodes.
LAS | UAS | LP | LR | LF | GF | TP | |
---|---|---|---|---|---|---|---|
DM | 83.56 | 84.71 | 83.20 | 40.73 | 54.68 | 66.19 | 94.97 |
PAS | 84.73 | 85.52 | 88.34 | 35.74 | 50.89 | 57.66 | 97.37 |
PCEDT | 83.53 | 91.19 | 74.82 | 62.08 | 67.84 | 90.70 | 92.45 |
To put these results into perspective, the table above also includes two static measures of the ‘tree reductions’ of Section 20 of the SDP data, viz. the labeled F1 for the ‘gold’ trees scored as an SDP graph (GF), and their averaged per-token degree of tree projectivity (again, computed by MaltEval: TP).
Acknowledgements
We are grateful to Zeljko Agic and Bernd Bohnet for advice in designing the baseline and ‘companion’ data, and for assistance in configuring MATE Tools and MaltEval, as well as to Milen Kouylekov for the development and support of the interactive search interface.