European Statistics Awards - Web Intelligence

Multinational enterprise groups play a major role in the European economy. In all EU and EFTA countries, they contribute substantially to the production of goods and services, employment and investments. Due to their importance, they are closely monitored by the National Statistical Institutes and Eurostat. According to the data of the Euro Groups Register (the European statistical business register on MNE groups created by the European Statistical System and managed by Eurostat), for the reference year 2022, MNE groups employed over 47 million people in EU-EFTA countries. This means that around 28 % of people employed in Europe worked for a multinational enterprise group. The majority (82 %) of them, worked in a small number of large multinational enterprise groups.

The goal of the Multinational Enterprise Group Data Discovery Challenge is to develop approaches that automatically identify sources of annual financial data on the World Wide Web for MNE Groups.

Participants will receive a list of 200 MNE Groups. The discovered sources of financial data and reports should be as recent as possible, credible and trustworthy, and contain as much financial data as possible.

Teams of up to 5 members are invited to register by 22 April 2025. All registered teams will receive the dataset on 23 April 2025 and begin competing simultaneously.

The competition will run for one month until 23 May 2025. Teams will have additional time until 30 June 2025 to submit a description of their developed approach.

Competition opening for registrations:

18 March 2025

Registration deadline:

22 April 2025

Data provided to all registered teams simultaneously on 23 April 2025

Submission deadlines:

23 May 2025 – Final Accuracy award submission deadline
30 June 2025 – Final Reusability and Innovativeness documentation submission deadline

Accuracy Award
First Prize EUR 10 000
Second Prize EUR 5 000
Third Prize EUR 3 000

Reusability Award
First Prize EUR 10 000
Second Prize EUR 5 000
Third Prize EUR 3 000

Innovativeness Award
First Prize EUR 5 000
Second Prize EUR 3 000
Third Prize EUR 1 000

The goal of the Multinational Enterprise Group Data Extraction Challenge is to develop approaches that automatically extract important annual financial data of MNE Groups.

Participants will receive a list of 200 MNE Groups. The extraction of MNE Group financial data should be done in an automated way, the data being as recent as possible, credible and trustworthy, and extracted in the correct format.

Teams of up to 5 members are invited to register by 22 April 2025. All registered teams will receive the dataset on 23 April 2025 and begin competing simultaneously.

The competition will run until 15 June 2025. Teams will have additional time until 30 June 2025 to submit a description of their developed approach.

Competition opening for registrations:

18 March 2025

Registration deadline:

22 April 2025

Data provided to all registered teams simultaneously on 23 April 2025

Submission deadlines:

15 June 2025 – Final Accuracy award submission deadline
30 June 2025 – Final Reusability and Innovativeness documentation submission deadline

Accuracy Award
First Prize EUR 10 000
Second Prize EUR 5 000
Third Prize EUR 3 000

Reusability Award
First Prize EUR 10 000
Second Prize EUR 5 000
Third Prize EUR 3 000

Innovativity Award
First Prize EUR 5 000
Second Prize EUR 3 000
Third Prize EUR 1 000

Online job advertisements contain various types of information including a job description, information about the company looking to hire, job benefits, requirements for job seekers, etc. In order to calculate meaningful statistics given the data collection method and size of the online job advertisements datasets, occupational class labels must be provided for these various entries. Within the WI CLASSIFICATION CHALLENGE, teams will compete using advanced modelling techniques to develop an efficient and robust automated solution for correctly assigning class labels.

The second round of European Statistics Awards for Web Intelligence will begin in June 2024 with registrations open until 15 July 2024.

The competition will begin on 1 June 2024 and will run for four months until 30 September 2024. The deadline for registration is 15 July 2024.

Accuracy Award
First Prize EUR 10 000
Second Prize EUR 5 000
Third Prize EUR 3 000

Reusability Award
First Prize EUR 10 000
Second Prize EUR 5 000
Third Prize EUR 3 000

Innovativity Award
First Prize EUR 5 000
Second Prize EUR 3 000
Third Prize EUR 1 000

Teams comprising a maximum of five individuals with diverse backgrounds and expertise in programming and web intelligence are eligible to participate in the competition. This contest presents an exceptional chance to apply your understanding of classification modelling in an actual context and potentially receive up to EUR 10 000 for developing the most accurate model. If your team secures the top spot for all three awards, you could earn up to EUR 25 000 in this round.

Find out more

The winners of the web intelligence - deduplication challenge have been announced

A part of the European Statistics Awards Program aims at stimulating innovation in the area of Web Intelligence for European statistics, focusing on identifying potential duplicate job postings on websites as a basic condition to produce high quality statistics from online job advertisements.

Find out more

Are we allowed to share the MNE data that we received on 23 April?

The MNE data are not confidential, so yes, they may be shared and submitted to third-party systems.

Are we allowed to use LLMs?

The use of LLMs is allowed. However, a full description of the developed approach is needed in order to be eligible to compete for a prize.

Simply prompting an LLM is not considered as the development of an algorithm-based approach which automatically identifies public web sources of annual financial data of MNE Groups or as the development of an algorithm-based approach which automatically extracts financial data of MNE Groups.

We want to just compete for the Accuracy Award. Can we skip sending in full documentation?

No, the description of your team’s developed approach is considered a necessary requirement in order to be eligible to receive any of the competition prizes.

There are multiple deadlines for Reusability and Innovativeness submissions. Which is the one that we need to keep?

The Reusability and Innovativeness documentation can be submitted at any phase of the competition, as long as the final deadline for this submission (30 June 2025) is kept.

How many submissions can my team make before the deadline?

Your team is encouraged to make early submissions with dummy data in order to ensure that the required files are in the correct format and that no technical issues arise once you make your final submission.Your team is encouraged to make early submissions with dummy data in order to ensure that the required files are in the correct format and that no technical issues arise once you make your final submission.

Each submission made ‘overwrites’ the previous submission. However, please take care to thoroughly check your final submission, as the technical limitation is a maximum of 1 submission per UTC calendar day until the submission deadline.

Is a fully automated URL discovery (within the extraction code) mandatory for the EXTRACTION challenge?

For the EXTRACTION challenge, teams can use any URL irrespectively of how the URL was identified (manually or automatically).

Automated URL identification is the goal of the DISCOVERY challenge. Therefore, if a team wants to make a submission for the Discovery challenge, the URLs must be identified in an automated way.

If the same team is participating in the EXTRACTION challenge, automated identification of the URLs is not a pre-requisite. URLs can be identified manually.

be identified using the algorithms of the DISCOVERY challenge
consist of URLs obtained by other means (URLs already on file with the team members, manually extracted URLs, etc.)

or a combination of both.

My team has made a submission and the submission has received only 1 point. Is there something wrong with my submission?

The submissions for the Discovery and Extraction Challenges will not be calculated automatically during the submission phase. The calculation of the scores will be done outside of the platform by the evaluation committee following the submission deadline.

The 1 point displayed on the right-hand side is a technical idiosyncracy of the leaderboard and could be disregarded; it has no relation whatsoever with the score ultimately assigned to the submissions of the team.

In the Extraction Challenge, teams are required to provide the reference year (REFYEAR) for the values of different variables, which is self-explanatory for financial accounts. However, what is the correct way of interpreting the requirement of extracting the REYEAR directly for "Country", "website", and main NACE activity of the MNE groups? Many financial reports do not contain the information regarding "website". For example, if information regarding the website is obtained for Apple Inc. from Wikipedia https://en.wikipedia.org/wiki/Apple_Inc., there is no information of REFYEAR for the "website". How should teams tackle this issue?

Features such as country, website and principal activity are indeed by their very nature more stable than turnover.

However, they are in principle not time-invariant – and as such, they are susceptible to change – an enterprise group could relocate, rebrand (leading to a change in their URL) or change its principal activity.

For an artefact such as the annual report, one possible approach might be to associate REFYEAR with the reference year of the report (unless more precise information is available in the report itself). For other sources, the same principles might be considered.

Ultimately, it is up to each team to devise methods that they believe to be likely to render data that are as close to the gold standard as possible.

For the Extraction Challenge, can teams manually create the URLs for part or the whole of the MNE groups, in order to extract values automatically?

For the Extraction Challenge, the starting point should be the list of URLs – no matter how it has been obtained. Thus, the source code should make clear how data, based on a list of URLs, are extracted from those sources.

The logical extension of this is that in case a team has an integrated pipeline that covers both (a) discovery (i.e. the establishment of a list of URLs with sources) and (b) extraction (i.e. the extraction of data from a list of URLs), then the code and its documentation should be sufficiently modular to allow the assessment of the component relevant to the challenge in question (i.e. (a) in the case of discovery and (b) in the case of extraction), since the evaluation is taking place in isolation for each of the two challenges.

Should the NACE code be extracted directly, or is it acceptable to predict it using a classifier based on a previously extracted description of the MNE?

Any programmatic solution that renders accurate results concerning the principal activity of an MNE (as per the NACE classification) is admissible - including e.g. classifiers.

Under which legal system is the NDA signed?

As a rule, EU law applies. The implementation of the terms of use shall be governed by Luxembourg law; the courts in Luxembourg shall have sole jurisdiction to hear any disputes.

In the event of a dispute (eg. breach of NDA) the Commission can take action by filing a complaint or by reporting the breach to the police on the basis of national legislation.

Is it permissible to use the eTranslation tool from the European Commission?

The current NDA doesn't foresee such a possibility and using eTranslation would mean losing control over data thus breaking the provisions of the NDA.

Moreover, the eTranslation service is not available to everybody so it would disrupt the level playing field for the other competitors.

Does the question regarding data security issues and the use of the ETRANSLATION tool extend to using other 3rd party APIs, for instance, Google Translate, OpenAI?

Yes, it does. Sending the job advertisement text to third-party API servers makes it accessible to those third parties, which violates the terms of the NDA.

Considering the terms of the NDA, are teams expected to develop their solutions locally on their own machines, or does the possible restriction on third party APIs extend to spinning up remote GPU machines for model training?

No, you don't have to develop your solution locally. You can use cloud infrastructure as long as access to the data is restricted to those who have signed the NDA. This means you are responsible for ensuring the security of the cloud resources you use. You must control who can access the data and ensure that data transmission between your local machines and the cloud is secure, such as through encryption.

Our team has made 2 failed and 1 successful submission. The performance ranking states that we've made 3 submission. Are we still able to make 9 successful submissions?

Failed submissions DO NOT count towards the submission limits. Only valid, successful submissions are counted.

We will periodically make corrections on the performance ranking page and adjust for failed attempts.

Even if the performance ranking currently indicates the total number of submissions which includes failed attempts, we will check the total number of VALID submissions during the evaluation phase and disregard all FAILED attempts, ensuring that each team is allowed 10 VALID submissions.

Multinational Enterprise (MNE) Groups Data Discovery challenge:
Identify web sources with MNE Group data

Timeline

Important Dates

Awards

Multinational Enterprise (MNE) Groups Data Extraction challenge:
Extract the MNE Group data

Timeline

Important Dates

Awards

THE CLASSIFICATION OF OCCUPATIONS FOR ONLINE JOB ADVERTISEMENTS CHALLENGE - The second round of the European Statistics Awards for Web Intelligence

Timeline

Awards

Teams

The Web Intelligence - Deduplication Challenge

Frequently asked questions - Discovery and Extraction Challenges

Are we allowed to share the MNE data that we received on 23 April?

Are we allowed to use LLMs?

We want to just compete for the Accuracy Award. Can we skip sending in full documentation?

There are multiple deadlines for Reusability and Innovativeness submissions. Which is the one that we need to keep?

How many submissions can my team make before the deadline?

Is a fully automated URL discovery (within the extraction code) mandatory for the EXTRACTION challenge?

My team has made a submission and the submission has received only 1 point. Is there something wrong with my submission?

For the Extraction Challenge, can teams manually create the URLs for part or the whole of the MNE groups, in order to extract values automatically?

Should the NACE code be extracted directly, or is it acceptable to predict it using a classifier based on a previously extracted description of the MNE?

Frequently asked questions - Deduplication and Classification Challenges

Under which legal system is the NDA signed?

Is it permissible to use the eTranslation tool from the European Commission?

Does the question regarding data security issues and the use of the ETRANSLATION tool extend to using other 3rd party APIs, for instance, Google Translate, OpenAI?

Considering the terms of the NDA, are teams expected to develop their solutions locally on their own machines, or does the possible restriction on third party APIs extend to spinning up remote GPU machines for model training?

Our team has made 2 failed and 1 successful submission. The performance ranking states that we've made 3 submission. Are we still able to make 9 successful submissions?

Multinational Enterprise (MNE) Groups Data Discovery challenge: Identify web sources with MNE Group data

Timeline

Important Dates

Awards

Multinational Enterprise (MNE) Groups Data Extraction challenge: Extract the MNE Group data

Timeline

Important Dates

Awards

THE CLASSIFICATION OF OCCUPATIONS FOR ONLINE JOB ADVERTISEMENTS CHALLENGE - The second round of the European Statistics Awards for Web Intelligence

Timeline

Awards

Teams

The Web Intelligence - Deduplication Challenge

Frequently asked questions - Discovery and Extraction Challenges

Are we allowed to share the MNE data that we received on 23 April?

Are we allowed to use LLMs?

We want to just compete for the Accuracy Award. Can we skip sending in full documentation?

There are multiple deadlines for Reusability and Innovativeness submissions. Which is the one that we need to keep?

How many submissions can my team make before the deadline?

Is a fully automated URL discovery (within the extraction code) mandatory for the EXTRACTION challenge?

My team has made a submission and the submission has received only 1 point. Is there something wrong with my submission?

For the Extraction Challenge, can teams manually create the URLs for part or the whole of the MNE groups, in order to extract values automatically?

Should the NACE code be extracted directly, or is it acceptable to predict it using a classifier based on a previously extracted description of the MNE?

Frequently asked questions - Deduplication and Classification Challenges

Under which legal system is the NDA signed?

Is it permissible to use the eTranslation tool from the European Commission?

Does the question regarding data security issues and the use of the ETRANSLATION tool extend to using other 3rd party APIs, for instance, Google Translate, OpenAI?

Considering the terms of the NDA, are teams expected to develop their solutions locally on their own machines, or does the possible restriction on third party APIs extend to spinning up remote GPU machines for model training?

Our team has made 2 failed and 1 successful submission. The performance ranking states that we've made 3 submission. Are we still able to make 9 successful submissions?

Multinational Enterprise (MNE) Groups Data Discovery challenge:
Identify web sources with MNE Group data

Multinational Enterprise (MNE) Groups Data Extraction challenge:
Extract the MNE Group data