The Web Intelligence - Deduplication Challenge
We are pleased to announce the outcome of the deduplication challenge – the very first round of European Statistics Awards on web intelligence.
Starting in 2022, Eurostat has been running engaging competitions in the fields of nowcasting and web intelligence with the primary goal of unveiling innovative methodologies and valuable data resources that could improve the production of European statistics.
Within this programme, which will run until the end of 2025, the initial rounds of the nowcasting and web intelligence competitions are now complete.
We thank all participants for having contributed to the success of this first round through their submissions. Moving forward, we aim to learn from this experience to improve future rounds. We encourage the community of data-savvy enthusiasts to consider joining upcoming competitions that will be announced in the coming months. The next challenge will be to classify online job advertisements by occupation. Stay tuned for more details and help us produce timely and detailed European statistics.
The evaluation of the Web Intelligence Deduplication Challenge is now complete, and we are happy to announce the winners of the Accuracy Award, AccuracyPlus Award and Reproducibility Award.
Congratulations to all the winners!
Web Intelligence – 1st round – The online job advertisement Deduplication Challenge
The European Statistics Awards Programme Web Intelligence competitions aim at stimulating innovation when retrieving data from the world wide web for producing European statistics. The first "deduplication" challenge was focused on identifying potential duplicates of job postings published on the web. Deduplication is a basic condition to produce high quality statistics from online job advertisements as companies often publish job advertisements on different web portals. Posting advertising the same jobs must be identified and removed using automatic and robust solutions that allow the treatment of big amounts of data in an efficient manner to avoid double counting.
The competition dataset contained 112 000 online job advertisements, retrieved from around 400 websites active in the European Union. The competition organisers have taken unique authentic job advertisements and created full, semantic, temporal and partial duplicates across different languages, thus creating a synthetic multilingual dataset for the competition.
The source of the original dataset is the European Web Intelligence Hub, where around 200 million online job advertisements have been collected and classified since July 2018.
The participants had to provide documented scripts in either R of Python that would identify duplicate job advertisements (full duplicate, semantic duplicate, temporal duplicate or partial duplicate). They had to address a number of challenges including identifying duplicates within a multilingual dataset by applying cross-lingual techniques (identifying semantic duplicates for online job advertisements in different languages), field mismatch (i.e. job advertisements having different field values which represent the same thing), etc. Handling cross-linguality is a specifically important task when employers are advertising jobs internationally.
The deduplication challenge was launched in December 2022, with a final deadline for submissions of duplicates in March and documentation in April 2023.
Participation was quite wide as a total of 69 teams, comprised of 137 individuals from 17 countries, signed up for this challenge. The results of the evaluation are announced below.
The participants were competing for three types of awards:
- Accuracy – for identifying as many duplicates as possible within the synthetic dataset created by the organisation team. The Accuracy Award addressed the cross-linguality aspects of the competition.
- AccuracyPlus – this is a "discovery" prize, rewarding teams that manage to find potential duplicates not identified by the organisation team. To resolve the issue that no gold standard exists, the AccuracyPlus scoring was based on inter-team agreement.
- Reproducibility – the most reproducible and scalable solutions for regular production.
Accuracy Award winners
Place | Prize | Team name | Team members | Country |
---|---|---|---|---|
1st place | 10 000 EUR | TwoTired |
Leonard Mandtler Axel Forsch Thomas Lüke |
Germany |
2nd place | 4 000 EUR | TheDeDuplicators |
Jannic Cutura Dimitris Petridis Stefan Pasch Charis Lagonidis |
Germany and Greece |
3rd place | 3 000 EUR | IDA |
Jakub Żerebecki Mikołaj Tym |
Poland |
AccuracyPlus Award winners
Place | Prize | Team name | Team members | Country |
---|---|---|---|---|
1st place | 3 000 EUR | Smrek |
Samo Kosík Marek Cedula Radoslav Čársky |
Slovakia |
2nd place | 2 000 EUR | StudentiUnibo |
Roberto Cornali Sofia Camilla Todeschini |
Italy |
3rd place | 1 000 EUR | name not disclosed (winner not yet reached) |
Reproducibility Award winners
Place | Prize | Team name | Team members | Country |
---|---|---|---|---|
1st place | 10 000 EUR | TheDeDuplicators |
Jannic Cutura Dimitris Petridis Stefan Pasch Charis Lagonidis |
Germany and Greece |
2nd place | 4 000 EUR | Nins | Antoine Palazzolo | France |
3rd place | 3 000 EUR | IDA |
Jakub Żerebecki Mikołaj Tym |
Poland |
Solutions of the Reproducibility Award winners
Since the Reproducibility award was intended to support the most thoroughly described and documented solutions, with the most innovative, open approach, we are happy to share with you the solutions which won 1st, 2nd and 3rd prize:
1st place: TheDeduplicators solution, which won 1st place is available on the following links:
2nd place: Nins solution, which won 2nd place is available on the following links:
3rd place: IDA solution, which won 3rd place is available on the following link: