European Statistics Awards - Web Intelligence - Deduplication Challenge

We are pleased to announce the outcome of the deduplication challenge – the very first round of European Statistics Awards on web intelligence.

Starting in 2022, Eurostat has been running engaging competitions in the fields of nowcasting and web intelligence with the primary goal of unveiling innovative methodologies and valuable data resources that could improve the production of European statistics.

Within this programme, which will run until the end of 2025, the initial rounds of the nowcasting and web intelligence competitions are now complete.

We thank all participants for having contributed to the success of this first round through their submissions. Moving forward, we aim to learn from this experience to improve future rounds. We encourage the community of data-savvy enthusiasts to consider joining upcoming competitions that will be announced in the coming months. The next challenge will be to classify online job advertisements by occupation. Stay tuned for more details and help us produce timely and detailed European statistics.

The evaluation of the Web Intelligence Deduplication Challenge is now complete, and we are happy to announce the winners of the Accuracy Award, AccuracyPlus Award and Reproducibility Award.

Congratulations to all the winners!

The European Statistics Awards Programme Web Intelligence competitions aim at stimulating innovation when retrieving data from the world wide web for producing European statistics. The first "deduplication" challenge was focused on identifying potential duplicates of job postings published on the web. Deduplication is a basic condition to produce high quality statistics from online job advertisements as companies often publish job advertisements on different web portals. Posting advertising the same jobs must be identified and removed using automatic and robust solutions that allow the treatment of big amounts of data in an efficient manner to avoid double counting.

The competition dataset contained 112 000 online job advertisements, retrieved from around 400 websites active in the European Union. The competition organisers have taken unique authentic job advertisements and created full, semantic, temporal and partial duplicates across different languages, thus creating a synthetic multilingual dataset for the competition.

The source of the original dataset is the European Web Intelligence Hub, where around 200 million online job advertisements have been collected and classified since July 2018.

The participants had to provide documented scripts in either R of Python that would identify duplicate job advertisements (full duplicate, semantic duplicate, temporal duplicate or partial duplicate). They had to address a number of challenges including identifying duplicates within a multilingual dataset by applying cross-lingual techniques (identifying semantic duplicates for online job advertisements in different languages), field mismatch (i.e. job advertisements having different field values which represent the same thing), etc. Handling cross-linguality is a specifically important task when employers are advertising jobs internationally.

The deduplication challenge was launched in December 2022, with a final deadline for submissions of duplicates in March and documentation in April 2023.

Participation was quite wide as a total of 69 teams, comprised of 137 individuals from 17 countries, signed up for this challenge. The results of the evaluation are announced below.

The participants were competing for three types of awards:

Accuracy – for identifying as many duplicates as possible within the synthetic dataset created by the organisation team. The Accuracy Award addressed the cross-linguality aspects of the competition.
AccuracyPlus – this is a "discovery" prize, rewarding teams that manage to find potential duplicates not identified by the organisation team. To resolve the issue that no gold standard exists, the AccuracyPlus scoring was based on inter-team agreement.
Reproducibility – the most reproducible and scalable solutions for regular production.

Place	Prize	Team name	Team members	Country
1st place	10 000 EUR	TwoTired	Leonard Mandtler Axel Forsch Thomas Lüke	Germany
2nd place	4 000 EUR	TheDeDuplicators	Jannic Cutura Dimitris Petridis Stefan Pasch Charis Lagonidis	Germany and Greece
3rd place	3 000 EUR	IDA	Jakub Żerebecki Mikołaj Tym	Poland

Place	Prize	Team name	Team members	Country
1st place	3 000 EUR	Smrek	Samo Kosík Marek Cedula Radoslav Čársky	Slovakia
2nd place	2 000 EUR	StudentiUnibo	Roberto Cornali Sofia Camilla Todeschini	Italy
3rd place	1 000 EUR		name not disclosed (winner not yet reached)

Place	Prize	Team name	Team members	Country
1st place	10 000 EUR	TheDeDuplicators	Jannic Cutura Dimitris Petridis Stefan Pasch Charis Lagonidis	Germany and Greece
2nd place	4 000 EUR	Nins	Antoine Palazzolo	France
3rd place	3 000 EUR	IDA	Jakub Żerebecki Mikołaj Tym	Poland

Since the Reproducibility award was intended to support the most thoroughly described and documented solutions, with the most innovative, open approach, we are happy to share with you the solutions which won 1st, 2nd and 3rd prize:

1st place: TheDeduplicators solution, which won 1st place is available on the following links:

2nd place: Nins solution, which won 2nd place is available on the following links:

3rd place: IDA solution, which won 3rd place is available on the following link:

The Web Intelligence - Deduplication Challenge

Web Intelligence – 1st round – The online job advertisement Deduplication Challenge

Accuracy Award winners

AccuracyPlus Award winners

Reproducibility Award winners

Solutions of the Reproducibility Award winners