Wikidata:Lexicographical data/Best practices

This page serves as a repository of best practices established over time by different lexeme contributors, often after some descriptions of them in other fora. These may be discussed on this page's talk page if desired.

Should there be a lexeme for it? edit

  • There should be some evidence of the existence of a lexeme in a language at the time that lexeme is created on Wikidata.
    • The better documented a language is in general, the more the above should be treated as a requirement rather than merely a best practice.
      • As a result, for languages like English, French, Spanish, Mandarin, Russian, and Arabic—that are supported by nation-states and that, by virtue of being used to communicate all sorts of information among very large groups of people, are expected to have diverse vocabularies—this should be taken as obligatory regardless of one's fluency in that language.
      • For less well-documented languages like Breton, Sindhi, Acehnese, and Guarani, this remains merely a strong recommendation: once a resource is found for that language, attempts should be taken to use it as evidence for as many existing lexemes in that language as possible.
      • For even less well-documented languages like Skolt Sami, Igbo, Angika, and Cia-Cia, this is much less binding even as a recommendation—especially when you are a native speaker of that language and can thus vouch for the use of a particular lexeme in your language community.
    • The evidence for the existence of a lexeme may be indicated in a number of ways:
  • In general, while individual words that aren't merely inflections of other words might warrant lexemes, non-idiomatic phrases typically do not warrant them, since they may be treated as the sum of their parts.
    • This does not necessarily discount the addition of non-idiomatic meaning senses to lexemes which do have idiomatic meanings, however, and which have those idiomatic meanings as senses already.

Lemmata edit

  • The lemma of a lexeme should ideally be the representation of that lexeme that is provided in a dictionary. What representation this is will generally depend on the lexeme's language and lexical category.
    • Take Indo-European languages: for nouns and adjectives, this may reflect some combination of nominative case, singular number, and masculine gender; for verbs, this may be the infinitival or verbal noun form.
    • Other languages may present lemmata differently, for which a non-exhaustive list is given below:
      • An Arabic verb generally uses the masculine third-person singular perfect active indicative as a lemma ('كَتَبَ' for 'to write').
      • A Korean verb generally uses the verb stem followed by the dedicated citation suffix '-다' ('가다' for 'to go').
      • An isiZulu verb generally uses the verb stem on its own, including the final vowel 'a' ('shaywa' for 'to be struck').
  • If there are multiple scripts in which a language is generally written, it is desirable for the lemma to contain a representation for each script.
    • Where a correspondence in representation exists between multiple related scripts, repeating that correspondence may not be necessary.
      • For those Mandarin lexemes which have not been affected by character simplification, a single lemma with code 'zh' suffices.
      • For those Esperanto lexemes which do not change under 'hsistemo' or 'xsistemo', a single lemma with code 'eo' suffices.

Lexical categories edit

  • In general, a instance of (P31) value on a lexeme should be more specific than the lexeme's lexical category.

Lexeme statements edit

Derivations edit

Acronyms should qualify derived from lexeme (P5191) with mode of derivation (P5886) acronym (Q101244) (see ffs (L406751)).

Forms edit

A large portion of the section 'Should it be a lexeme?' can also apply here.

To help establish the existence and use of a lexeme, at least one form should be referenced—perhaps on a usage example (P5831) statement qualified with subject form (P5830) [the form in question], or on another statement (described by source (P1343), attested as (P7855) or attested in (P5323) are possible other properties). The goal is to have all forms attested or referenced with at least one date, preferably with these dates years apart.

Senses edit

Translations edit