Deduplication Challenge FAQ

What general instructions and advise is there?

The competition is focused on innovative approaches to identifying job duplicates with the aim of more accurately estimating the number of currently open jobs on the market. Therefore, certain aspects of the competition have been left open for the teams to explore in order to find various more effective approaches.

We would advise teams to focus on the actual content. As an example: two job postings can have different titles and descriptions (as in use of words), but advertise the same job position, while positions with similar descriptions in different countries are considered different job positions since they would be taken by different people.

What types of duplicates do you distinguish between?

In order to aid competitors in creating solutions, we are providing the following content descriptions:

  • FULL duplicates: two job advertisements are identical
  • SEMANTIC duplicates: two job advertisements are describing the same position but are written differently
  • TEMPORAL duplicates: the same job advertisement was posted at a different point in time
  • PARTIAL duplicates: two job advertisements may be referring to the same job position, but one has missing information as compared to the other, and we cannot state with certainty whether it is in fact the same position
  • NON-DUPLICATES: two job advertisements are not describing the same job position. Teams are not required to submit non-duplicates.

How are full duplicates defined?

Full duplicates represent the posting of the same job position as two separate entries. Two job advertisements can be considered as full duplicates when all of the values are the same.

Below are some examples:

  • If the company (or country) are different, then the jobs should not be counted as "full duplicates".
  • Entries which are full duplicates can have descriptions written in either upper- or lower-case text.
  • Two identical partial 1 and 2 duplicates of the same record (job posting) 3, are considered full duplicates.

How are partial duplicates defined?

Partial duplicates are those where one of the fields is "missing". For instance, if one job advertisement has a “missing” company name, but all other information is the same as in the second job advertisement, then the two job advertisements are partial duplicates. If more than one characteristic is missing, then they are non-duplicates. In other words, two job advertisements are considered partial duplicates if they describe the SAME job position AND one job advertisement is missing a characteristic of the other one.

As partial duplication is not a transitive relationship, it might be the case that 1 is deemed to be a partial duplicate of 2 and 2 is a partial duplicate of 3 – whereas 1 does not meet the criterion to be a partial duplicate of 3 – in which case only
would be submitted.

How are temporal duplicates defined?

Temporal duplicates are two job advertisements posted at two different times.

How are semantic duplicates defined?

Semantic duplicates describe the same job position but are worded differently.

For example: If title+description+other fields in Job#1 carries the same semantic information as title+description+other fields in Job#2, jobs are semantic duplicates.

What should the "type" column contain if multiple types of duplicates apply for a relationship?

The full duplicate is the least specific type, the semantic is more specific, while the temporal and partial duplicates are the most specific type of duplicates.

In the case multiple labels are possible, please follow the format below:

What is the proper format for the submission?

Transitivity is not inferred, so the teams need to provide all individual pairs they deem to be duplicates.

For instance, assuming that the team considers the following IDs to be full duplicates: [1, 2, 3, 4] then every pair has to be explicitly indicated, as follows:


How many records (1 record = 1 job posting) are there in the data set?

In the dataset, we provide the job advertisements in the form they are found on the web. We do not format the title or body of the advertisement, as a result of which the body can, for example, contain newline characters.

The number of records in the dataset is 112 006. Each team is required to parse the data set correctly.

What information is shown in the leaderboard?

The teams can make up to 10 successful submissions, with the intention of enabling them to improve their results by adjusting their approach. The purpose of the leaderboard is to enable the teams to (1) compare their results with those of other teams and (2) compare their own results between submissions, thereby improving their method to reach better results. For each team, the submission which yields the best result will be in the running for receiving one or more awards.

Nowcasting Competition FAQ

I am not from Europe - can I still participate?

Yes, the European Statistics Awards for Nowcasting are open for participants from all continents. Please refer to the eligibility overview sections of the three currently ongoing competitions:

Where are the data that I should use for nowcasting?

Unlike a traditional hackathon, teams do not get any particular data to use for the European Statistics Awards for Nowcasting. To edge out your competitors, you are free to use whichever auxiliary data you think are the most predictive ones – but finding them will be a challenge!

Can I submit nowcasts for European aggregates as well as for countries?

You can only submit nowcasts for countries. To maximise your chances to win, we encourage you to submit for as many countries as possible.

I missed the 29 September registration deadline – what should I do?

If you have missed to register by 29 September, then you will not be able to submit any nowcast for September 2022 – but all teams that submit six consecutive nowcasts are still in the running (please refer to the Country Score as defined in the Glossary) – so you can still take part in the EU nowcasting awards competition even if you sign up in October or November.

Still, for models of equal predictive value, teams that already begin submitting for September have somewhat higher chances to get a higher accuracy score, since it is the best 6-month streak of nowcasts that counts!

How do I join a team to participate in this competition?

To join an existing team:
Only team leaders can designate the members of a team. If you are a group of colleagues organising a team together, please make sure that the team leader adds you to the team – that is the only way for you to join it.

To add members to an existing team (if you are the team leader):
If you wish to have more than 1 member, you may participate with up to 4 other individuals. After logging into, go to Settings (top right-hand corner). In the text box “Team members” add the first and last (surname) name of each member, their email, nationality and country of residence. Please make sure that each member is added on a new line in the text box.

To create a new team:
You may form your own team, with you being the only member, by registering at:

Can our first point estimate be for e.g. October 2022 or November 2022 or are we obliged to submit point estimates for September 2022?

You are not obliged to make a submission for September 2022. Your first submission can be as late as November 2022. However, you are strongly encouraged to make your submissions as soon as possible as this will increase your chance at success and lower the probability that a submission will be invalid, due to the late publication of the April 2022 European statistics for a time series.

Can I change my model?

To get a higher accuracy score, you might modify the model that you use for an entry. Please note, however, that if you are in the running for the reproducibility awards, then the best integrity scores are awarded if you consistently apply the same model.

How do I make a submission?

In order to make your submission, create a zip file containing

  1. the point_estimates.json file, and
  2. the accuracy_approach_description.docx file.

Zip the files directly, not in a folder.

  • navigate to “Participate > Submit / View Results”,
    select the corresponding "Reference Period", and
  • click the “Upload a Submission” button.
    This will open your file explorer, from where you can select the zip file you wish to upload.

The system returns an error in case you are uploading the json file instead of a zip file.

I am trying to upload my json files and receiving an error. Why?

The system returns an error in case you are uploading the json file instead of a zip file. In the zip file, the files have to be directly in the root, not in a subfolder.

For further details, see how to submit.

Is it a must to submit the result in JSON?

Yes, the submission must be made in JSON format – but the JSON files must be embedded in a zip file.

For further details, see how to submit.

Does the Accuracy_Approach_description file need to be uploaded every time while submitting a result?

Yes – each monthly submission must be accompanied by an Accuracy_Approach_description file – but you do not need to change it between submissions if you are using the same approach. If you adjust your method between submissions, then the Accuracy_Approach_description file should be updated to reflect this.

Do you expect participants to deliver a description of the approach together with the submission of the first point estimate or is it possible to deliver the description of the approach with the submission of a latter point estimate?

Yes – In order to be eligible to compete for the Accuracy award, each monthly submission must be accompanied by an Accuracy_Approach_description file that outlines the approach. You do not need to change it between submissions if you are using the same approach. If you adjust your method between submissions, then the Accuracy_Approach_description file should be updated to reflect this.

I made multiple submissions for September. Which submissions will enter the competition? The latest, the best, or all of them?

Only the last submission counts – it supersedes all previous submissions.

What does the leaderboard indicate?

Before the first release of European statistics for a given reference month, the leaderboard

  • indicates, for each team, the number of entries made during that month
  • is ordered by the time at which the last submission was made (and is thus completely unrelated to the forthcoming accuracy score of the team).

Once the European statistics have been released (and the entries have been evaluated) for the time series in question, the leaderboard

  • is updated for each team to reflect the performance of the best entry (out of the up to five different entries) of the team for that month
  • is ordered by the performance of the best entry.

When can I expect an answer to my question which I sent to

Questions can be sent to the competition info email at any time during the competition. All questions will be answered in a timely manner. Those relevant for other competitors will be added to the FAQ. However, in order to avoid bias, questions that arrive 3 days prior to each submission deadline date will be answered after the deadline. Following each monthly deadline, the answering of questions will resume normally.

When can I expect to receive confirmation of my registration?

Confirmation of a new team will be executed one day following the registration. In case a new registration is made on the day of a registration deadline, a confirmation on the following day will still render that the registration was within the requirements of the competition rules and all eligibility requirements will have been met.