Skip to main content

TectoMT: Modular NLP Framework

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6233))

Abstract

In the present paper we describe TectoMT, a multi-purpose open-source NLP framework. It allows for fast and efficient development of NLP applications by exploiting a wide range of software modules already integrated in TectoMT, such as tools for sentence segmentation, tokenization, morphological analysis, POS tagging, shallow and deep syntax parsing, named entity recognition, anaphora resolution, tree-to-tree translation, natural language generation, word-level alignment of parallel corpora, and other tasks. One of the most complex applications of TectoMT is the English-Czech machine translation system with transfer on deep syntactic (tectogrammatical) layer. Several modules are available also for other languages (German, Russian, Arabic). Where possible, modules are implemented in a language-independent way, so they can be reused in many applications.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ratnaparkhi, A.: A maximum entropy part-of-speech tagger. In: Proceedings of the conference on Empirical Methods in Natural Language Processing, pp. 133–142 (1996)

    Google Scholar 

  2. Minnen, G., Carroll, J., Pearce, D.: Robust Applied Morphological Generation. In: Proceedings of the 1st International Natural Language Generation Conference, Israel, pp. 201–208 (2000)

    Google Scholar 

  3. McDonald, R., Pereira, F., Ribarov, K., Hajič, J.: Non-Projective Dependency Parsing using Spanning Tree Algorithms. In: Proceedings of Human Langauge Technology Conference and Conference on Empirical Methods in Natural Language Processing (HTL/EMNLP), Vancouver, BC, Canada, pp. 523–530 (2005)

    Google Scholar 

  4. Nivre, J., Hall, J., Nilsson, J., Chanev, A., Eryigit, G., Kübler, S., Marinov, S., Marsi, E.: MaltParser: A language-independent system for data-driven dependency parsing. Natural Language Engineering 13(2), 95–135 (2007)

    Google Scholar 

  5. Bojar, O., Mareček, D., Novák, V., Popel, M., Ptáček, J., Rouš, J., Žabokrtský, Z.: English-Czech MT in 2008. In: Proceedings of the Fourth Workshop on Statistical Machine Translation, Association for Computational Linguistics, Athens, Greece, pp. 125–129 (March 2009)

    Google Scholar 

  6. Bojar, O., Hajič, J.: Phrase-Based and Deep Syntactic English-to-Czech Statistical Machine Translation. In: ACL 2008 WMT: Proceedings of the Third Workshop on Statistical Machine Translation, Association for Computational Linguistics, Columbus, OH, USA, pp. 143–146 (2008)

    Google Scholar 

  7. Mareček, D., Žabokrtský, Z., Novák, V.: Automatic Alignment of Czech and English Deep Syntactic Dependency Trees. In: Hutchins, J., Hahn, W. (eds.) Proceedings of the Twelfth EAMT Conference, Hamburg, HITEC e.V, pp. 102–111 (2008)

    Google Scholar 

  8. Bojar, O., Žabokrtský, Z.: Building a Large Czech-English Automatic Parallel Treebank. Prague Bulletin of Mathematical Linguistics 92 (2009)

    Google Scholar 

  9. Rouš, J.: Probabilistic translation dictionary. Master’s thesis, Faculty of Mathematics and Physics, Charles University in Prague (2009)

    Google Scholar 

  10. Kos, K., Bojar, O.: Evaluation of Machine Translation Metrics for Czech as the Target Language. Prague Bulletin of Mathematical Linguistics 92 (2009)

    Google Scholar 

  11. Hajič, J., Cinková, S., Čermáková, K., Mladová, L., Nedolužko, A., Petr, P., Semecký, J., Šindlerová, J., Toman, J., Tomšů, K., Korvas, M., Rysová, M., Veselovská, K., Žabokrtský, Z.: Prague English Dependency Treebank, Version 1.0 (January 2009)

    Google Scholar 

  12. Hajič, J., Ciaramita, M., Johansson, R., Kawahara, D., Martí, M.A., Màrquez, L., Meyers, A., Nivre, J., Padó, S., Štěpánek, J., Straňák, P., Surdeanu, M., Xue, N., Zhang, Y.: The CoNLL-2009 shared task: Syntactic and semantic dependencies in multiple languages. In: Proceedings of the 13th Conference on Computational Natural Language Learning (CoNLL-2009), Boulder, Colorado, USA, June 4-5 (2009)

    Google Scholar 

  13. Romportl, J.: Zvyšování přirozenosti strojově vytvářené řeči v oblasti suprasegmentálních zvukových jevů. PhD thesis, Faculty of Applied Sciences, University of West Bohemia, Pilsen, Czech Republic (2008)

    Google Scholar 

  14. Kravalová, J.: Využití syntaxe v metodách pro vyhledávání informací (using syntax in information retrieval). Master’s thesis, Faculty of Mathematics and Physics, Charles University in Prague (2009)

    Google Scholar 

  15. Kravalová, J., Žabokrtský, Z.: Czech Named Entity Corpus and SVM-based Recognizer. In: Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration (NEWS 2009), Association for Computational Linguistics, Suntec, Singapore, pp. 194–201 (2009)

    Google Scholar 

  16. Mareček, D., Kljueva, N.: Converting Russian Treebank SynTagRus into Praguian PDT Style. In: Proceedings of the RANLP 2009, International Conference on Recent Advances in Natural Language Processing, Borovets, Bulgaria (2009)

    Google Scholar 

  17. Sgall, P.: Generativní popis jazyka a česká deklinace. Academia, Prague (1967)

    Google Scholar 

  18. Hajič, J., Hajičová, E., Panevová, J., Sgall, P., Pajas, P., Štěpánek, J., Havelka, J., Mikulová, M.: Prague Dependency Treebank 2.0. Linguistic Data Consortium, LDC Catalog No.: LDC2006T01, Philadelphia (2006)

    Google Scholar 

  19. Zeman, D., Hana, J., Hanová, H., Hajič, J., Hladká, B., Jeřábek, E.: A Manual for Morphological Annotation, 2nd edn., Technical Report 27, ÚFAL MFF UK, Prague, Czech Republic (2005)

    Google Scholar 

  20. Hajičová, E., Kirschner, Z., Sgall, P.: A Manual for Analytic Layer Annotation of the Prague Dependency Treebank (English translation). Technical report, ÚFAL MFF UK, Prague, Czech Republic (1999)

    Google Scholar 

  21. Mikulová, M., Bémová, A., Hajič, J., Hajičová, E., Havelka, J., Kolářová, V., Kučová, L., Lopatková, M., Pajas, P., Panevová, J., Razímová, M., Sgall, P., Štěpánek, J., Urešová, Z., Veselá, K., Žabokrtský, Z.: Annotation on the tectogrammatical level in the Prague Dependency Treebank. Annotation manual. Technical Report 30, ÚFAL MFF UK, Prague, Czech Rep (2006)

    Google Scholar 

  22. Conway, D.: Perl Best Practices. O’Reilly Media, Inc., Sebastopol (2005)

    Google Scholar 

  23. Pajas, P., Štěpánek, J.: Recent advances in a feature-rich framework for treebank annotation. In: Scott, D., Uszkoreit, H. (eds.) The 22nd International Conference on Computational Linguistics - Proceedings of the Conference, The Coling 2008 Organizing Committee, Manchester, UK, vol. 2, pp. 673–680 (2008)

    Google Scholar 

  24. Pajas, P., Štěpánek, J.: XML-based representation of multi-layered annotation in the PDT 2.0. In: Hinrichs, R.E., Ide, N., Palmer, M., Pustejovsky, J. (eds.) Proceedings of the LREC Workshop on Merging and Layering Linguistic Information (LREC 2006), Genova, Italy, pp. 40–47 (2006)

    Google Scholar 

  25. Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics 19(2), 313–330 (1994)

    Google Scholar 

  26. McEnery, A., Baker, P., Gaizauskas, R., Cunningham, H.: EMILLE: Building a corpus of South Asian languages. Vivek-Bombay 13(3), 22–28 (2000)

    Google Scholar 

  27. Smrž, O., Bielický, V., Kouřilová, I., Kráčmar, J., Hajič, J., Zemánek, P.: Prague Arabic Dependency Treebank: A Word on the Million Words. In: Proceedings of the Workshop on Arabic and Local Languages (LREC 2008), Marrakech, Morocco, pp. 16–23 (2008)

    Google Scholar 

  28. Boguslavsky, I., Iomdin, L., Sizov, V.: Multilinguality in ETAP-3: Reuse of Lexical Resources. In: Sérasset, G. (ed.) COLING 2004 Multilingual Linguistic Resources, Geneva, Switzerland, August 28, pp. 1–8 (2004)

    Google Scholar 

  29. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: an architecture for development of robust HLT applications. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, July 07-12 (2002)

    Google Scholar 

  30. Mel’čuk, I.A.: Towards a functioning model of language. Mouton (1970)

    Google Scholar 

  31. Tyers, F.M., Sánchez-Martínez, F., Ortiz-Rojas, S., Forcada, M.L.: Free/open-source resources in the Apertium platform for machine translation research and development. Prague Bulletin of Mathematical Linguistics 93, 67–76 (2010)

    Article  Google Scholar 

  32. Wilcock, G.: Linguistic Processing Pipelines: Problems and Solutions. In: Book of Abstracts GSCL Workshop: Linguistic Processing Pipelines (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Popel, M., Žabokrtský, Z. (2010). TectoMT: Modular NLP Framework. In: Loftsson, H., Rögnvaldsson, E., Helgadóttir, S. (eds) Advances in Natural Language Processing. NLP 2010. Lecture Notes in Computer Science(), vol 6233. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14770-8_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-14770-8_33

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-14769-2

  • Online ISBN: 978-3-642-14770-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics