|Date:||March 12-13, 2009|
Fostering Language Resources Network
Roberto Cencioni, Walther Lichem, Nicoletta Calzolari, Gerhard Budin
EC - DG Information Society & Media - Unit INFSO.E1 - LTs & MT, LUX / Head of Unit
Language Technology & MT unit: INFSO.E1
- new unit established in July 2008
- focus on multilingual technologies, services, apps
- two instruments in 2009:
- Research: FP7 ICT, call 4 (objective 2.2: language based interaction)
- Innovation: CIP ICT-PSP, call 3 (theme 5: multilingual web)
Web is a collaborative framework. But there are significant language barriers. Europ has 23+ widely-used languages -- not enough translators. Want to have a single European information space: bridge the language barriers in the information society.
Goals: support & enhance interpersonal & business communication, information access, and publishing..
- for everyone
- across langauges
- emphasis on online environments
Flarenet: an international forum to facilitate interaaction, and...
- overcome fragmentation
- create a shared policy for the field of langauge resources & technologies
- foster a European strategy for consolidating the sector and enhancing competitiveness
Language resources & technologies:
- used by many different communities
- various dimensions: technical, organizational, economic, legal, political
Promote international cooperation
- E.g., w/ US SILT initiative
Session 1: Broadening the Coverage, Addressing the Gaps
Joseph Mariani (LIMSI/IMMI-CNRS, FR)
- Langauge coverage
- Topic coverage
Availability of langauge resources of sufficient quality & quantity is needed to develop language technologies
LRs and LTs in one language:
- define the task
- Determine the needed LTs
- Determine the annotations of LRs and metrics & protocols for evaluation
- Find a way to support production of LRs and evaluation (incl financial, organizational)
- Produce LRs
- Develop and evaluate LTs
- Cycle back to improve
Try to fill in language resource matrices: one axis is list of languages (incl dialects); the other lists resources (lexicon, pronunciationl lexicon, dictionary, treebank, etc). Fill in matrix squares with information about how much data is available.
Similarly, fill in a langauge technology matrix: tokenizer, named entities, chunker, pos tagger, parser, spell checker, summarizer, search engine, text to speech, ASR, etc. Fill in matrix with best available performance, etc.
Multi-lingual matrix showing how many parallel systems, or machine translation systems, exist for various pairs of languages. Matrix could be amount of data, system performance, etc.
Coverage & BLARKS
Steven Krauwer (Universiteit Utrecht, NL) & Khalid Choukri (ELDA, FR)
BLARK = Basic Langauge resources Kit
- Defines the minimum set of LRs necessary to do education or pre-competitive langauge and speech technology research for a language at all
- LR includes written, spoken, & multimodal materials, and modules and tools
- Can be used to measure the technological coverage of a language
- Can be used as an agenda for creating new resources
It's dynamic, because technology, requirements, and expectations evolve over time.
So the notion of "minimum set of resources" needs to be periodically re-evaluated
Used with a view to developing language and speech technology applications for a language.
- Entry level BLARKettes for languages with virtually no tech support, mainly aimed at training and education.
- Standard BLARK, serving education and pre-competitive research
- Extended BLARK, serving advanced research & commercial development
It's important that these collected tools be coherent, and interoperable.
Build on existing initiatives:
- Try to give an authoritative definition of what BLARK (currently) contains
- Analysis per language of where we stand
- Mechanisms for maintenance
BLARK as tool for langauge resource coverage assesment, road mapping, and language policy planning:
- Many LR pieces are missing.
- Some are available, but not at "fair conditions"
- What exists but is not available (traded vs non-traded)
- What exists but is not identified
- What is lost -- not archived
- What does not exist at all
What can we (langauge experts) do for BLARK?
- Define, specify, improve: See www.blark.org
- Help enhance the content of the universal catalogue
What can BLARK do for you?
- Help you w/ input for R&D
What can funding agencies do for BLARK?
- Make sure the info is accurate and reflects community plans/trends
- Used as a consistent planning/roadmapping inventory
- Used as an inventory of the state of the art
Universal catalogue of common metadata.
Interface w/ other communities: check who is producing, as a "side-product," essential LRs for HLT: publishers, broadcast companies, etc. Public data.
"Research Fair Act" a la European Union -- look this up.
Practical Considerations in Resource Creation Tied to Human Language Technology Development
Christopher Cieri (University of Pennsylvania - LDC, USA)
GALE: transcription, translation, and distillation
LCTL: langauge packs for IE and translation in 1-2 dozen langauges (LCTL = less commonly taught languages)
LVDID: train & test material for SRE
(this was an interesting talk -- see slides, because I didn't take many notes.)
Tradeoffs, such as quality vs quantity, may change as the amount of data changes, as the qualities of the tools changes, etc.
It's important to give away tools and specs -- allows good outsourcing
When a corpus is donated to LDC, distribution is never exclusively via LDC. Distribution via multiple sites is a good thing.
An African Perspective on Language Resources and Technologies
Justus Roux (University of Stellenbosch, S. Africa)
Interest in HLT development in Africa
- LR & LT
- Renewed linguistic & cultural awarenes of indigenous langauges
Which languages should resources be developed for?
- Colonial langauges
- Indigenous langauges
Role of national governments with respect to local languages
Role of the private sector?
- Companies appreciate the utility of supportning local langauges -- competitive edge
Support development LTs/LRs of "African" varieties of European languages
In africa, speech based systems are a priority. High illiterarcy rate levels. Limited internet penetration. High penetration (~50%) of cell phone services -- as high as 95% penetration in some countries.
Coverage of What? – Gaps in What? On De-globalising Human Language Resources
Dafydd Gibbon (Universität Bielefeld, DE)
Primarily we generally represent..
- our own interests, and the interests of our groups
- but also the interests of larger political entities in which we live
But our competitive funding systems are:
- exclusive in that they create temporary research islands
- inclusive in creating sustainable research infrastuctures
Cooperative instruments are also needed.
- Implies advancement in the future of domains, methods, applicaitons
- Implies a failure in the past: small repairable omission; reference to some platonic ideal of completeness.
But our coverage is a drop in the ocean, so "gap" isn't really an appropriate term -- the gap is huge.
De-globalization: concentration on & respect for less mainstream languages.
- Knowledge of complex forms & functions of various languages
- Digital divide: technology gap, and its special case the digital divide, is taken to be asymmetrical
Sponsorship of underprivileged colleagues/groups. Vicious circle: colleagues can't afford contact with competitive finishing, and publishing conventions; and therefore can't meet the standards.
What does a BLARKette cost? Meetings are the engine, but funding is necessary.
A Dynamic View of Comparable and Specialized Corpora
Pierre Zweigenbaum (LIMSI-CNRS, FR)
- A corpus is not a random heap of texts.
- A corpus is organized
- bitexts and parallel corpora. Useful for MT. Limited in number and variety.
- c.f. "comparable corpora," which are similar according to a set of dimensions (topics, genres, etc). Much easier to get, but they're not direct translations of one another. They don't pre-exist: they need to be selected and paired.
- Range: parallel corpora, noisy parallel corpora, comparable corpora.
- Need a way to measure "comparability". Some measures have been proposed. Should be useful for evaluating, but also for generating comparable corpora.
- There's also comparable corpora within a single language
Activities in specialized domains: sublanguage
- terms for terminology
- relations -> structure terminologies and ontologoies
- specialized multilingual corpora for translation
Examples: MEDLINE, JRC Acquis (EU law)
Specifying a specialized corpus:
- intended audience
Multiple dimensions. Hard to forsee all needs. Can't design and build all specialized corpora that we want in advance.
Evolution of knowledge: terminology evolves with time. Value of knowledge depends on currency -- specialized corpora generally need to evolve with time: need to be updated. Also need on-demand selection of subsets of the text, carefully selected according to needs.
Methods are needed to measure and characterize dimensions of specialization
Are any corpora not specialized?
Technology for Processing Non-verbal Information in Speech
Nick Campbell (Trinity College Dublin, IRL & NIST, JP)
Speech is action & interaction.
Current speech technology is founded on text.
There's a mismatch between the expectation of the systems and the performance of its users.
Talk in social interaction involves propositional context, but also other information channels
Systems that process human speech need to be able to interpret the underlying speech acts. Not enough to say what the person is saying, but need to know what they're doing (in the context of the conversation). A lot of communication comes from nonverbal signals, incl affective speech sounds such as laughs, feedback noises, grunts, etc. Constitute a small finite set of highly variable sounds in which most of the information is carried by prosody and tone-of-voice.
- Adam Przepiórkowski (Polish Academy of Sciences - ICS, PL)
- Marko Tadić (University of Zagreb - FHSS - DL, HR)
- Kepa Sarasola Gabiola (University of the Basque Country - IXA Group, SP)
- Folkert de Vriend (Nederlandse Taalunie, NL-BE)
Session 2: Automatic and Innovative Means of Acquisition, Annotation, Indexing
Methods & models for building, validating, and maintaining raw & annotated LRs
Questions/parameters: required volume, coverage, manual vs automatic vs semiautomatic, standards/formats, language (in)dependence, performance, cost, time
Primary language data are abundant on the web -- for many languages, and for an increasing number of languages.
Web data contains "ill-formed" text communication. Also, images, videos, etc
Some of the data that are on the web are basically annotations already -- e.g., summaries, transcipts/subtitles, image captions, opinions, etc.
For this session:
- current methods & scale of use
- suitability, success stories, best practices
- missing, new resources; future targets; priorities
- manual vs automatic
- data volume vs quality tradeoff
- well formed vs "ill-formed" language
- LR needs of areas related to langauge (multimedia, cognitive resources, etc)
- social networking
Rich Annotations of Text and Community Annotation
Jun'ichi Tsujii (University of Manchester - NacTeM, UK)
MEDLINE: 2000 abstracts, ~500k words
Linguistic annotation: tokenization, pos tagging, parse tree, dependency, deep syntax, coref
Ontology-based annotation: NER, RR, Event Recognizer, Pathway Constructer
Community annotation: have biologists annotate pathways.
LT applications vs LRs
Dialogue corpora remain a problem
Yorick Wilks (University of Sheffield, UK)
Trends in Language Resources and New Work in ASR Data Labeling
Gary Strong (Johns Hopkins University - HLT Center of Excellence, USA)
- Hand annotation to Semi-supervised
- Corpus-based annotation to non-stationary steam processing, and adaptation to domain/genre changes.
- Moving from small datasets to effectively infinite streams
Going for a Hunt? Don’t Forget the Bullets!
Dan Ioan Tufis (RACAI, RO)
Misconceptions about language resources: ML can do everything with lots of raw data; human expertise is less and less needed.
- partially true, but with accurate annotations the data hunger is much lower and the quality of services is increased
- minimal set of pre-processing steps (BLARK-like resources)
The quality of LRs come from the accuracy of linguistic annotations
- BLARK-like resources & tools
- Better scenario: start with very clean annotated data, and bootstrap. But make sure the expert is in the development chain, to validate and correct annotations.
Automatic Lexical Acquisition - Bridging Research and Practice
Anna Korhonen (University of Cambridge, UK)
The Democratisation of Language Resources
Gregory Grefenstette (Exalead, FR)
Advocates building a simple tabular lexicon for each language. Incl. word form, root, pos tag, freq, simple translation.
Web3.0 and Language Resources
Marta Sabou (Open University, UK)
Exploiting Croudsourced Language Resources for Natural Language Processing: 'Wikabularies' and the Like
Iryna Gurevych (Technische Universität Darmstadt - UKP Lab, DE)
- Kiril Simov (Bulgarian Academy of Sciences - IPP - LML, BG)
- Sophia Ananiadou (University of Manchester - NacTeM, UK)
- Guy De Pauw (Universiteit Antwerpen, BE)
(at this point, I was fairly tired from jet lag, and stopped taking notes -- see the webpage for the position papers & slides for session 3)
S4 - Interoperability and Standards
"SILT: Towards Sustainable Interoperability for Language Technology"
James Pustejovsky (Brandeis University - DCS, USA) & Nancy Ide (Vassar College - DCS, USA
"Interoperability, Standards and Open Advancement"
Eric Nyberg (Carnegie Mellon University, USA)
"Is the LRT Field Mature Enough for Standards?"
Peter Wittenburg (MPG, NL)
"Interoperability via Transforms"
Edward Loper (Brandeis University, USA)
"Ontology of Language Resource and Tools for Goal-oriented Functional Interoperability"
Key-Sun Choi (KAIST, KR)
"Towards Interoperability of Language Resources and Technologies (LRT) with Other Resources and Technologies"
Thierry Declerck (DFKI, DE)
Discussants: Tomaž Erjavec (Jožef Stefan Institute, SI) Chu-Ren Huang (Hong Kong Polytechnic University, HK) Timo Honkela (Helsinki University of Technology - CIS, FI) Yohei Murakami (NICT, JP)
Session 5: Translation, Localisation, Multilingualism
Language Resources and Tools for Machine Translation
Hans Uszkoreit (DFKI, DE)
There's been progress in statistical MT, and in linguistic processing (parsing, morphology, generation, etc)
Less progess, but increased use, in rule-based MT.
Increasing use of hybrid systems, and system combination.
For SMT: no good solutions for non-local grammatical phenomena; and no good solutions for (lexical and syntactic) gaps in training data
For hybrid MT: lack of confidence estimation; lack of good solution for gaps in rules
- Insufficient classification of data with respect to its domain, etc
- Insufficient parallel texts, and sufficient coverage of different domains, etc
Outlook for Spoken Language Translation
Marcello Federico (FBK, IT)
Three Challenges for Localisation
Josef Van Genabith (Dublin City University - NCLT, IRL)
Assessing User Satisfaction with Embedded MT
Tony Hartley (University of Leeds, UK)
Institutional Translators and LRT
Josep Bonet-Heras (EC - DG Translation, LUX)
Language Technology in the European Parliament's Directorate General for Translation: Facts, Problems and Visions
Alexandros Poulis (EP - DG Translation - IT Support Unit, LUX)
'Cloud Sourcing' for the Translation Industry
Andrew Joscelyne (TAUS, FR)
Discussants Frank Van Eynde (Katholieke Universiteit Leuven - CCL, NL) Harold Somers (Dublin City University - SC, IRL)