Data Journalism in the AI era: Analysis, ethics, and human oversight

The images illustrating this article come from the project: Better Images of AI (, which aims to highlight that the images commonly used today about AI often misrepresent the technology, reinforce harmful stereotypes, and spread misleading cultural ideas.

The translation of this text was assisted by generative AI

By Hassel Fallas, Journalist and data analyst

Since the second decade of this century, data analysis journalism has become an essential tool for investigating and revealing information of high public interest. In this process, artificial intelligence (AI) algorithms play an increasingly prominent role, facilitating the analysis of large volumes of data to correlate information and find relevant patterns that improve the quality of inputs and key evidence to support investigations.

What steps should be taken to achieve these objectives? This article compiles the ideas I presented during the forum: AI and Confusion; Journalism in the Age of Algorithms, beginning the discussion with access to data sources, cleaning and structuring protocols for training AI models aimed at journalism. Not forgetting the most relevant aspect: the need for human oversight and ethical considerations.

The Genesis of the Process

The information used for data analysis journalism and training AI models can come from the same sources. Some examples include:

  • Access to Public Information: Formal requests to government institutions.
  • Data Scraping: Extracting data from websites.
  • Open Data: Government and international organization portals.
  • APIs: Interfaces that allow the mass and systematic extraction of updateable data.
  • Social Media: Data collected from the APIs of platforms like Twitter, Instagram, and Facebook.
  • Surveys and Sensors: Data collected from field studies and monitoring devices.
  • Images and Geospatial Data: Visual and location information.
  • Manual Database Construction: Extracting data from news.

When using this information as a source for journalistic data analysis, it is crucial to understand the data thoroughly to avoid misinterpretations (metadata).

These data sources must also undergo strict cleaning before being used for analysis. Data must be cleaned, standardized, and appropriately structured before using any software to perform mathematical and statistical analyses.

This process is also a core part of training AI-based solutions, but other steps need to be added.

Before explaining those steps, it is important to note that artificial intelligence (AI) is a technology that tries to make computers act intelligently, as a human would. To act with human-like intelligence, computers need algorithms. Algorithms are sets of rules and procedures that guide computers to perform tasks such as pattern recognition, learning from data, and understanding and interpreting natural language.

For a machine to think and solve problems, human intervention is always indispensable, as people provide computers with the raw materials to generate alternatives for solving a problem and supervise the quality of the solution they offer.

Training an AI model for journalistic use

Training an artificial intelligence (AI) model for data journalism is a process that requires time, meticulousness, and precision.

Data science journalists must follow a series of structured steps to ensure that the results are as reliable as possible. Here is a summary of them:

  1. Clearly define the problem to be addressed with the help of AI.
  2. Gather a sufficiently large, representative, and reliable amount of data for the AI to learn from and reduce the risk of biases.
  3. Clean the data: imputing missing values, deciding what to do with absent values, changing data formats (e.g., dates appearing as text in the database). In data science, there is a maxim: «Garbage in, garbage out,» meaning that entering poor-quality data results in unreliable outcomes. The collected information must be extremely precise; otherwise, data analysis will not be reliable.
  4. Divide the data into training and testing sets. Determine what percentage of the data will be used for training and what percentage for testing the model. Separating the data into training and testing sets helps determine if the model’s outcome is replicable with other new data sets.
  5. Choose an appropriate model and algorithm for the problem. For example, if the problem is determining the value of a property based on characteristics such as size, location, and the number of rooms, linear regression can be applied. If the problem is determining the probability of a disease based on clinical history data, logistic regression can be used.
  6. Adjust the parameters the algorithm must follow for learning.
  7. Train the model.
  8. Review the model’s results.
  9. Readjust the parameters for validation and adjustment.
  10. Implementation and monitoring.

To clarify this process further, a demonstrative video is provided below, illustrating each of the explained steps practically.

Human oversight and ethical considerations

The use of artificial intelligence (AI) in data journalism offers numerous advantages, but it also poses significant ethical challenges and the need for constant human oversight. As AI algorithms and generative AI become increasingly integrated into journalistic practices, it is essential to address key issues such as accuracy, transparency, human judgment, and the principle of doing no harm, as detailed in the following precautions.

Accuracy and bias mitigation

Data accuracy is vital to avoid biases and ensure representative and inclusive results. AI algorithms can inherit biases from the data they are trained on, which can result in unbalanced and potentially unfair reports, especially in sensitive areas such as social justice, politics, and health. Therefore, it is crucial for interdisciplinary teams to evaluate AI model results and implement design and maintenance practices that identify and mitigate biases, such as careful selection of data sets and regular review of algorithm performance.

Principle of “Do no harm”

Adhering to the “do no harm” principle is crucial when using data to train AI solutions. This principle seeks to ensure that no one suffers greater harm due to the implementation of AI. For example, publishing the exact coordinates of endangered animals could facilitate their illegal hunting. It is also essential to protect data privacy and security in both the public and private sectors. Measures must be implemented to ensure that sensitive data is protected and used ethically and responsibly.


Maintaining transparency in the use of AI is crucial for preserving public trust. It is important for news organizations to clearly disclose when and how AI systems are used in the creation of data analyses and content. This includes explaining the algorithms used and the data sources so that the public can assess the credibility of the reports.

In the case of generative AI, it is relevant to inform if the analyses and texts are being created – partly or entirely – with this technology.

Currently, some news organizations prohibit the use of generative AI for these purposes, while others endorse it under strict editorial supervision and fact-checking, a key aspect since it is well known that this technology is as prone to errors as any human.

Loss of human journalistic judgment

Finally, a consideration is that when generative AI assumes most of the data analysis and reporting tasks, there is a great risk of losing human journalistic judgment, which incorporates empathy with people and a deep understanding of social and cultural contexts that algorithms cannot fully replicate.

It is crucial for journalists to continue playing an active role in interpreting and presenting the data analyzed by AI. Journalism must remain journalism. As I once wrote in another article: «Journalism will continue to be a matter of wearing out one’s saliva, neurons, and shoe soles. It’s about going to the places and sources to confront numbers with reality.»

error: No se puede descargar