🔔 News
- 02.12.23: Proceedings available at https://ceur-ws.org/Vol-3577/
- 25.08.23: Winning systems in each track announced.
- 09.08.23: Final deadline extension to August 14, 11:59 pm CEST time, for both system and paper submissions.
- 29.07.23: Deadline extended to August 10 for system and prediction submission, and 11 August for paper submission.
- 26.07.23: Test data subject entities have been released on the GitHub repo. Please do a
git pull
for getting the test dataset and updated evaluate.py script. Submit your predictions on CodaLab to get your scores. - 11.07.23: Submit your validation data predictions on CodaLab to get a score now (this is optional and test data leaderboard will be separate and released later).
- 10.07.23: New baseline (GPT3 + Wikidata NED) added to repository.
- 22.05.23: v1.0 of dataset (train and validation splits) released.
- 17.05.23: Test output/system submission deadline extended to August 2, 2023. Take time to submit your strongest systems!
Task Description
Pretrained language models (LMs) like chatGPT have advanced a range of semantic tasks and have also shown promise for knowledge extraction from the models itself. Although several works have explored this ability in a setting called probing or prompting, the viability of knowledge base construction from LMs remains underexplored. In the 2nd edition of this challenge, we invite participants to build actual disambiguated knowledge bases from LMs, for given subjects and relations. In crucial difference to existing probing benchmarks like LAMA (Petroni et al., 2019), we make no simplifying assumptions on relation cardinalities, i.e., a subject-entity can stand in relation with zero, one, or many object-entities. Furthermore, submissions need to go beyond just ranking predicted surface strings and materialize disambiguated entities in the output, which will be evaluated using established KB metrics of precision and recall.Formally, given the input subject-entity (s
) and relation
(r
), the task is to predict all the correct object-entities
({o1
, o2
, ..., ok
})
using LM probing.
The challenge comes with two tracks:
- Track 1: a small-model track with low computational requirements (<1 billion parameters)
- Track 2: an open track, where participants can use any LM of their choice
Track 1: Small-model track (<1 billion parameters)
Participants are free to use any pretrained LM containing at most 1 billion parameters. This includes, for instance, BERT, BART, GPT-2, and variants of OPT. The input tuples can be paraphrased through prompt engineering techniques (e.g., AutoPrompt,LPAQA), and participants can also use prompt ensembles for better output generation. However, using context (e.g., verbalizing tuples using supporting sentences) is not allowed in this track.Track 2: Open track
In the open track, the task is the same as in the small-model track. However,- LMs of any size, e.g., GPT-3, can be probed.
- Use of context is allowed for LM-generation, e.g., context retrieval like in REALM and factual predictions with context.
🏆 Winners
Track | System | Avg. Precision | Avg. Recall | Avg. F1-score |
---|---|---|---|---|
1 | Expanding the Vocabulary of BERT for Knowledge Base Construction Dong Yang, XU Wang, Remzi Celebi |
0.395 | 0.393 | 0.323 |
2 | Using Large Language Models for Knowledge Engineering (LLMKE): A Case Study on Wikidata
Bohui Zhang, Ioannis Reklos, Nitisha Jain, Albert Meroño-Peñuela, Elena Simperl |
0.715 | 0.726 | 0.701 |
Dataset
We release a dataset (train and validation) for a diverse set of 21 relations, each covering a different set of subject-entities and along with complete list ground truth object-entities per subject-relation-pair. The total number of object-entities varies for a given subject-relation pair. The train dataset subject-relation-object triples can be used for training or probing the language models in any form, while validation can be used for hyperparameter tuning. Futher details on the relations are given below:
Relation | Description | Example |
---|---|---|
BandHasMember | band (s ) has a member (o )
|
Show/Hide
(Q941293, N.E.R.D., BandHasMember, [Q14313, Q706641, Q2584176], [Pharrell Williams, Chad Hugo, Shay Haley]) |
CityLocatedAtRiver | City (s ) is located at the river (o )
|
Show/Hide
(Q365, Cologne, CityLocatedAtRiver, [Q584], [Rhine]) |
CompanyHasParentOrganisation | Company (s ) has another company (o ) as its parent organization
|
Show/Hide
(Q39898, NSU, CompanyHasParentOrganisation, [Q246], [Volkswagen]) |
CompoundHasParts | chemical compound (s ) consists of an element (o )
|
Show/Hide
(Q150843, Hexadecane, [Q623, Q556], [carbon, hydrogen]) |
CountryBordersCountry | country (s ) shares a land border with another country (o )
|
Show/Hide
(Q1020, Malawi, CountryBordersCountry, [Q924, Q953, Q1029], [Tanzania, Zambia, Mozambique]) |
CountryHasOfficialLanguage | country (s ) has an official language (o )
|
Show/Hide
(Q334, Singapore, CountryHasOfficialLanguage, [Q1860, Q5885, Q9237, Q727694], [English, Tamil, Malay, Standard Mandarin]) |
CountryHasStates | country (s ) has the state (o )
|
Show/Hide
(Q702, Federated States of Micronesia, CountryHasStates, [Q221684, Q1785093, Q7771127, Q11342951], [Chuuk, Kosrae State, Pohnpei State, Yap State]) |
FootballerPlaysPosition | Footballer (s ) plays in the position (o )
|
Show/Hide
(Q455462, Antoine Griezmann, FootballerPlaysPosition, [Q280658], [forward]) |
PersonCauseOfDeath | person (s ) died due to a cause (o )
|
Show/Hide
(Q5238609, David Plotz, PersonCauseOfDeath, [ ], [ ]) |
PersonHasAutobiography | person (s ) has the autobiography (o )
|
Show/Hide
(Q6279, Joe Biden, PersonHasAutobiography, [Q100221747], [Promise Me Dad]) |
PersonHasEmployer | person (s ) is or was employed by a company (o )
|
Show/Hide
(Q11476943, Yōichi Shimada, PersonHasEmployer, [Q4845464], [Fukui Prefectural University]) |
PersonHasNobelPrize | person (s ) has the nobel prize (o )
|
Show/Hide
(Q65989, Wolfgang Pauli, PersonHasNobelPrize, [Q38104], [Nobel Prize in Physics]) |
PersonHasNumberOfChildren | person (s ) has number of children (o )
|
Show/Hide
(Q7599711, Stanley Johnson, PersonHasNumberOfChildren, [6], [6]) |
PersonHasPlaceOfDeath | person (s ) died at a location (o )
|
Show/Hide
(Q4369225, Alina Pokrovskaya, PersonHasPlaceOfDeath, [ ], [ ]) |
PersonHasProfession | person (s ) held a profession (o )
|
Show/Hide
(Q468043, Jon Elster, PersonHasProfession, [Q121594, Q188094, Q1238570, Q2306091, Q4964182], [professor, economist, political scientist, sociologist, philosopher]) |
PersonHasSpouse | person (s ) has spouse (o )
|
Show/Hide
(Q5111202, Chrissy Teigen, PersonHasSpouse, [Q44857], [John Legend]) |
PersonPlaysInstrument | person (s ) plays an instrument (o )
|
Show/Hide
(Q15994935, Emma Blackery, PersonPlaysInstrument, [Q6607, Q61285, Q17172850], [guitar, ukulele, voice]) |
PersonSpeaksLanguage | person (s ) speaks the language (o )
|
Show/Hide
(Q18958964, Witold Andrzejewski, PersonSpeaksLanguage, [Q809], [Polish]) |
RiverBasinsCountry | river (s ) basins in a country (o )
|
Show/Hide
(Q45403, Brahmaputra River, RiverBasinsCountry, [Q148, Q668], [People's Republic of China, India]) |
SeriesHasNumberOfEpisodes | series (s ) has (o ) number of episodes
|
Show/Hide
(Q12403564, Euphoria, SeriesHasNumberOfEpisodes, [10], [10]) |
StateBordersState | state (s ) shares a border with another state (o )
|
Show/Hide
(Q1204, Illinois, StateBordersState, [Q1166, Q1415, Q1537, Q1546, Q1581, Q1603], [Michigan, Indiana, Wisconsin, Iowa, Missouri, Kentucky]) |
Each row in the dataset files constitutes of (1) subject-entity-id, (2) subject-entity, (3) list of all possible object-entities-id, (4) list of all possible object-entities and (5) relation. Please read the data format section for more details. When the subjects have zero valid objects, the ground truth is an empty list, e.g., (Q2283, Microsoft, [ ], [ ], CompanyHasParentOrganisation).
Dataset Characteristics
For each of the 21 relations, the number of unique subject-entities in the train, dev, and test are given in the GitHub repo. The minimum and maximum number of object-entities for each relation is given below. If the minimum value is 0, then the subject-entity can have zero valid object-entities for that relation.
Relation | Train | Val | Test |
---|---|---|---|
BandHasMember | [2, 15] | [2, 16] | [2, 16] |
CityLocatedAtRiver | [1, 9] | [1, 5] | [1, 9] |
CompanyHasParentOrganisation | [0, 5] | [0, 3] | [0, 5] |
CompoundHasParts | [2, 6] | [2, 5] | [2, 6] |
CountryBordersCountry | [1, 17] | [1, 10] | [1, 17] |
CountryHasOfficialLanguage | [1, 16] | [1, 11] | [1, 16] |
CountryHasStates | [1, 20] | [1, 20] | [1, 20] |
FootballerPlaysPosition | [1, 2] | [1, 3] | [1, 2] |
PersonCauseOfDeath | [0, 1] | [0, 3] | [0, 1] |
PersonHasAutobiography | [1, 4] | [1, 4] | [1, 4] |
PersonHasEmployer | [1, 6] | [1, 13] | [1, 6] |
PersonHasNobelPrize | [0, 1] | [0, 2] | [0, 1] |
PersonHasNumberOfChildren | [1, 1] | [1, 2] | [1, 1] |
PersonHasPlaceOfDeath | [0, 1] | [0, 1] | [0, 1] |
PersonHasProfession | [1, 11] | [1, 12] | [1, 11] |
PersonHasSpouse | [1, 3] | [1, 3] | [1, 3] |
PersonPlaysInstrument | [1, 8] | [1, 8] | [1, 8] |
PersonSpeaksLanguage | [1, 10] | [1, 4] | [1, 10] |
RiverBasinsCountry | [1, 9] | [1, 5] | [1, 9] |
SeriesHasNumberOfEpisodes | [1, 2] | [1, 1] | [1, 2] |
StateBordersState | [1, 16] | [1, 12] | [1, 16] |
Task Evaluation
For each test instance, predictions are evaluated by calculating precision and recall against ground-truth values. The final macro-averaged F1-score is used to rank the participating systems.Baselines
We provide several baselines:- Standard prompt for HuggingFace models with Wikidata default disambiguation: These baselines can be instantiated with various HuggingFace models (e.g., BERT, OPT), generate entity surface forms, and use the Wikidata entity disambiguation API to generate IDs.
- Few-shot GPT-3 directly predicting IDs: This baseline uses a few samples to instruct GPT-3 to directly predict Wikidata IDs.
- Few-shot GPT-3 w/ NED: Like above, but predicting surface forms that are disambiguated via Wikidata's default disambiguation.
Method | Avg. Precision | Avg. Recall | Avg. F1-score |
---|---|---|---|
GPT-3 NED (Curie model) | 0.308 | 0.210 | 0.218 |
GPT-3 IDs directly (Curie model) | 0.126 | 0.060 | 0.061 |
BERT | 0.368 | 0.161 | 0.142 |
Submission Details
Participants are required to submit:
- A system implementing the LM probing approach, uploaded to a public GitHub repo
- The output for the test dataset subject entites, in the same GitHub repo
- A system description in PDF format (5-12 pages, CEUR workshop style), mentioning the GitHub repo.
The PDF must be uploaded on OpenReview. Additionally, there is an optional CodaLab live leaderboard that participants can submit to. The test dataset is initially hidden to preserve the integrity of results, and will be released 1 week before the final deadline. The output files for the test subject-entities must be formatted as described here, and submitted along with the system and its description. The top performing systems will get an opportunity to present their ideas and results during conference, and the challenge proceedings will be submitted to CEUR publication system.
Organizers
Huawei Technology R&D UK