Methodology

Our investigation into global media coverage patterns combines multiple data sources to understand if global media attention aligns with the severity of conflicts. We analyzed the enormous database provided by the GDELT Project, cross-referenced with fatality data from ACLED's conflict database.

The methodology involves two key steps: data extraction by means of browsing the ACLED website and querying the GDELT database through Google BigQuery, data cleaning using Python Notebooks and Pandas. Below, we detail the technical implementation of our data pipeline and analytical framework.

GDELT Data Extraction

The procedure we followed to extract and analyze data from the GDELT database involved querying the dataset using SQL on Google BigQuery. The initial step was to generate a lookup table as described by an article on the GDELT website.

lookup.sql

-- Extract lookup table for domain to country mapping
CREATE OR REPLACE TABLE gdelt-dataviz-project.gdelt_agg.domain_lookup AS
SELECT domain, topcountry.countrycode AS MediaCountryCode
FROM (
    SELECT 
        domain, 
        -- We take the strict Top 1 result from the array
        ARRAY_AGG(STRUCT(countrycode, cnt) ORDER BY cnt DESC LIMIT 1)[OFFSET(0)]
            AS topcountry 
    FROM (
        SELECT 
            -- OFFICIAL GDELT REGEX: Extracts the clean country code
            REGEXP_EXTRACT(location, r'^[1-5]#.*?#(.*?)#') AS countrycode, 
            -- Clean the domain
            IFNULL(NET.REG_DOMAIN(DocumentIdentifier), 'unknown') AS domain, 
            -- Exact Count
            COUNT(1) AS cnt 
        FROM `gdelt-bq.gdeltv2.gkg_partitioned`, 
            -- OFFICIAL GDELT UNNEST: Explodes the list of locations
            UNNEST(SPLIT(V2Locations, ';')) AS location 
        WHERE 
            LOWER(DocumentIdentifier) LIKE '%http%'                 
        GROUP BY countrycode, domain 
        HAVING cnt > 5 AND countrycode IS NOT NULL
    ) 
    GROUP BY domain
)

The previous query returned a handy lookup table mapping media domains to their respective country codes based on the geographic distribution of their articles. Below you can see few selected rows of the generated lookup table.

domain	MediaCountryCode
aruba.com	AA
dubaitravelblog.com	AE
de24live.de	GM
genovatoday.it	IT
novamallorca.com	SP

After creating the domain-to-country lookup table, we queried the GDELT database to aggregate media mentions weekly. The primary key of our aggregation consists of the following dimensions: mention_week, conflict_country, actor1_country, actor2_country, media_country.

weekly_media_conflict_reports.sql

CREATE OR REPLACE TABLE gdelt-dataviz-project.gdelt_agg.weekly_media_conflict_reports AS
SELECT
  -- 1. TIME: Weekly Granularity
  DATE_TRUNC(
    DATE(PARSE_TIMESTAMP('%Y%m%d%H%M%S', CAST(m.MentionTimeDate AS STRING))), 
    WEEK(MONDAY)
  ) AS mention_week,

  -- 2. LOCATION: WHERE the conflict is happening
  e.ActionGeo_CountryCode AS conflict_country,

  -- 3. ACTORS: WHO is involved in the conflict
  IFNULL(e.Actor1CountryCode, 'UNK') AS actor1_country,
  IFNULL(e.Actor2CountryCode, 'UNK') AS actor2_country,

  -- 4. MEDIA SOURCE: WHERE the media is reporting from
  -- Uses the GDELT Lookup we created above
  IFNULL(dlc_lookup.MediaCountryCode, 'UNK') AS media_country,

  -- 5. METRICS (Aggregated)
  COUNT(*) AS mentions_count,
  COUNT(DISTINCT m.GLOBALEVENTID) AS distinct_events,
  COUNT(DISTINCT m.MentionSourceName) AS distinct_media_sources,

  -- 6. METRICS: CLASS PROFILE (Volume vs Diversity)
  -- Volume (How many articles?)
  COUNT(CASE WHEN e.QuadClass = 3 THEN 1 END)
  AS verbal_conflict_mentions,
  COUNT(CASE WHEN e.QuadClass = 4 THEN 1 END)
  AS material_conflict_mentions,
  -- Diversity (How many unique events?)
  COUNT(DISTINCT CASE WHEN e.QuadClass = 3 THEN m.GLOBALEVENTID END)
  AS verbal_conflict_unique_events,
  COUNT(DISTINCT CASE WHEN e.QuadClass = 4 THEN m.GLOBALEVENTID END)
  AS material_conflict_unique_events,

  -- 7. METRICS: ACTOR PROFILE
  -- STATE (Gov, Mil, Police, ecc)
  COUNT(CASE WHEN 
      e.Actor1Type1Code IN ('GOV', 'MIL', 'COP', 'JUD', 'LEG', 'SPY') OR 
      e.Actor2Type1Code IN ('GOV', 'MIL', 'COP', 'JUD', 'LEG', 'SPY') 
  THEN 1 END) AS state_mentions,
  
  COUNT(DISTINCT CASE WHEN 
      e.Actor1Type1Code IN ('GOV', 'MIL', 'COP', 'JUD', 'LEG', 'SPY') OR 
      e.Actor2Type1Code IN ('GOV', 'MIL', 'COP', 'JUD', 'LEG', 'SPY') 
  THEN m.GLOBALEVENTID END) AS state_unique_events,

  -- INSURGENTS (Rebels, Separatists, etc)
  COUNT(CASE WHEN 
      e.Actor1Type1Code IN ('REB', 'INS', 'SEP', 'OPP', 'CRM') OR 
      e.Actor2Type1Code IN ('REB', 'INS', 'SEP', 'OPP', 'CRM') 
  THEN 1 END) AS insurgents_mentions,

  COUNT(DISTINCT CASE WHEN 
      e.Actor1Type1Code IN ('REB', 'INS', 'SEP', 'OPP', 'CRM') OR 
      e.Actor2Type1Code IN ('REB', 'INS', 'SEP', 'OPP', 'CRM') 
  THEN m.GLOBALEVENTID END) AS insurgents_unique_events,

  -- CIVILIANS (The Victims usually)
  COUNT(CASE WHEN 
      e.Actor1Type1Code IN ('CVL', 'REF', 'ELI', 'BUS', 'EDU', 'LAB', 'MED', 'REL', 'AGR') OR 
      e.Actor2Type1Code IN ('CVL', 'REF', 'ELI', 'BUS', 'EDU', 'LAB', 'MED', 'REL', 'AGR') 
  THEN 1 END) AS civilians_mentions,

  COUNT(DISTINCT CASE WHEN 
      e.Actor1Type1Code IN ('CVL', 'REF', 'ELI', 'BUS', 'EDU', 'LAB', 'MED', 'REL', 'AGR') OR 
      e.Actor2Type1Code IN ('CVL', 'REF', 'ELI', 'BUS', 'EDU', 'LAB', 'MED', 'REL', 'AGR') 
  THEN m.GLOBALEVENTID END) AS civilians_unique_events,

  -- INTERNATIONALS (NGO, IGO)
  COUNT(CASE WHEN 
      e.Actor1Type1Code IN ('IGO', 'NGO', 'MNC') OR 
      e.Actor2Type1Code IN ('IGO', 'NGO', 'MNC') 
  THEN 1 END) AS international_mentions,

  COUNT(DISTINCT CASE WHEN 
      e.Actor1Type1Code IN ('IGO', 'NGO', 'MNC') OR 
      e.Actor2Type1Code IN ('IGO', 'NGO', 'MNC') 
  THEN m.GLOBALEVENTID END) AS international_unique_events,

  -- Sentiment Analysis
  AVG(m.MentionDocTone) AS avg_tone,
  STDDEV(m.MentionDocTone) AS stddev_tone,
  APPROX_QUANTILES(m.MentionDocTone, 2)[OFFSET(1)] AS median_tone,

  -- Impact (Weighted by mentions)
  AVG(e.GoldsteinScale) AS avg_impact,
  STDDEV(e.GoldsteinScale) AS stddev_impact,
  APPROX_QUANTILES(e.GoldsteinScale, 2)[OFFSET(1)] AS median_impact,

  -- Most mentioned article
  ARRAY_AGG(
    STRUCT(m.MentionIdentifier, e.NumMentions) 
    ORDER BY e.NumMentions DESC LIMIT 1
  )[OFFSET(0)].MentionIdentifier AS top_event_article_url,

  -- Most impactful article
  ARRAY_AGG(
    STRUCT(m.MentionIdentifier, e.GoldsteinScale) 
    ORDER BY e.GoldsteinScale ASC LIMIT 1
  )[OFFSET(0)].MentionIdentifier AS most_impactful_event_article_url  


FROM `gdelt-bq.gdeltv2.eventmentions_partitioned` m
JOIN `gdelt-bq.gdeltv2.events_partitioned` e
  ON m.GLOBALEVENTID = e.GLOBALEVENTID

-- JOIN LOOKUP
LEFT JOIN `gdelt-dataviz-project.gdelt_agg.lookup_table` dlc_lookup
  ON NET.REG_DOMAIN(m.MentionIdentifier) = dlc_lookup.Domain

WHERE
  m._PARTITIONTIME >= TIMESTAMP('2015-01-01')
  AND e._PARTITIONTIME >= TIMESTAMP('2015-01-01')
  
-- We only consider material and verbal conflict events
  AND e.QuadClass in (3,4)

  -- Geographic Filter (Exclude events without country or generalist)
  AND e.ActionGeo_CountryCode IS NOT NULL

GROUP BY 1, 2, 3, 4, 5
HAVING mentions_count > 5

In order to focus our attention on just conflict, we used the provided GDELT attribute QuadClass, which categorizes events into four broad classes: Verbal Cooperation (1), Material Cooperation (2), Verbal Conflict (3), and Material Conflict (4). For our analysis, we filtered the dataset to include only events classified as Verbal Conflict and Material Conflict, ensuring that our insights are specifically tailored to understanding media coverage of conflicts.

Below you can see few selected rows of the resulting weekly aggregated media mentions related to conflicts.

mention_week	conflict_country	actor1_country	actor2_country	media_country	mentions_count	distinct_events	distinct_media_sources	verbal_conflict_mentions	material_conflict_mentions	verbal_conflict_unique_events	material_conflict_unique_events	state_mentions	state_unique_events	insurgents_mentions	insurgents_unique_events	civilians_mentions	civilians_unique_events	avg_tone	stddev_tone	median_tone	avg_impact	stddev_impact	median_impact	top_event_article_url	most_impactful_event_article_url
2026-01-12	UK	UNK	GBR	UP	16	13	11	3	13	3	10	9	7	0	0	5	4	-2.48	2.58	-2.96	-6.34	3.29	-7.2	https://www.unian.ua/world/britaniya-hoche-finansuvati-ukrajinu-za-rahunok-rosiyskoji-nafti-the-times-13254819.html	https://www.eurointegration.com.ua/news/2026/01/12/7228928/
2026-01-12	FR	FRA	USA	US	101	28	71	41	60	11	17	17	5	0	0	44	2	-2.31	3.04	-0.34	-5.36	2.00	-5.0	https://www.stltoday.com/news/local/crime-courts/article_ee54c885-b7ea-4670-8669-bf76252d1fd1.html	https://www.thenation.com/article/activism/puerto-rico-mothers-against-war-madres-contra-guerra/
2026-01-12	IS	GBR	UNK	UK	8	8	5	6	2	6	2	1	1	1	1	3	3	-5.19	3.18	-4.31	-3.69	2.46	-2.0	https://www.cityam.com/londons-record-low-confidence-blamed-on-labour-workers-rights-bill/	https://www.thetimes.com/uk/politics/article/badenoch-condemns-police-protesters-israeli-restaurant-tp5fzbq6c
2026-01-12	UP	EUR	RUS	UK	208	5	105	207	1	4	1	1	1	0	0	0	0	-4.748191	0.274187	-4.770318	-2.052885	0.591478	-2.0	https://www.andoveradvertiser.co.uk/news/national/25760295.us-accuses-russia-dangerous-inexplicable-escalation-war-ukraine/	https://www.thetimes.com/world/russia-ukraine-war/article/russia-ukraine-invasion-longer-soviet-union-ww2-involvement-g8gb2ljw5

GDELT Data Management

The dataset we collected from GDELT consists of 7 802 681 rows and 28 columns, which leads to a total size of 2.53 GB. To effectively store this large volume of data locally we utilized the Parquet file format, known for its efficient storage and retrieval capabilities. We also subdivided the dataset into smaller, more manageable chunks, as described in the code snippet below.

dataset_creation.py

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json                            
path_mentions = "../data/GDELT/conflict_data_final.csv"
df_mentions = pd.read_csv(path_mentions)
df_mentions.to_parquet("../data/GDELT/conflict_data_final.parquet", index=False)
key_cols = ['mention_week', 'conflict_country', 'actor1_country', 'actor2_country', 'media_country']
df_keys = df_mentions[key_cols].drop_duplicates().reset_index(drop=True)
df_keys['conflict_id'] = df_keys.index
df_mentions = df_mentions.merge(df_keys, on=key_cols, how='left')

# Now create normalized tables:
save_path = "../data/GDELT/"

# 1. Dimension table (the keys + ID)
df_dimensions = df_mentions[['conflict_id'] + key_cols].drop_duplicates()
df_dimensions.to_parquet(save_path + "gdelt_keys.parquet", index=False)

# 2. Volume metrics (counts and mentions)
df_volume = df_mentions[['conflict_id', 'mentions_count', 'distinct_events', 'distinct_media_sources']]
df_volume.to_parquet(save_path + "gdelt_volume_metrics.parquet", index=False)

# 3. Conflict type metrics (verbal vs material)
df_conflict_types = df_mentions[['conflict_id', 
                         'verbal_conflict_mentions', 'material_conflict_mentions',
                         'verbal_conflict_unique_events', 'material_conflict_unique_events']]
df_conflict_types.to_parquet(save_path + "gdelt_conflict_types.parquet", index=False)

# 4. Actor type metrics (state, insurgents, civilians, international)
df_actors = df_mentions[['conflict_id',
                 'state_mentions', 'state_unique_events',
                 'insurgents_mentions', 'insurgents_unique_events',
                 'civilians_mentions', 'civilians_unique_events',
                 'international_mentions', 'international_unique_events']]
df_actors.to_parquet(save_path + "gdelt_actor_metrics.parquet", index=False)

# 5. Sentiment/tone metrics
df_sentiment = df_mentions[['conflict_id', 
                    'avg_tone', 'stddev_tone', 'median_tone']]
df_sentiment.to_parquet(save_path + "gdelt_sentiment.parquet", index=False)

# 6. Impact metrics
df_impact = df_mentions[['conflict_id',
                 'avg_impact', 'stddev_impact', 'median_impact']]
df_impact.to_parquet(save_path + "gdelt_impact.parquet", index=False)

# 7. Article references (URLs)
df_articles = df_mentions[['conflict_id', 
                   'top_event_article_url', 'most_impactful_event_article_url']]
df_articles.to_parquet(save_path + "gdelt_articles.parquet", index=False)

print("Files created:")
print(f"- gdelt_keys.parquet ({df_dimensions.shape})")
print(f"- gdelt_volume_metrics.parquet ({df_volume.shape})")
print(f"- gdelt_conflict_types.parquet ({df_conflict_types.shape})")
print(f"- gdelt_actor_metrics.parquet ({df_actors.shape})")
print(f"- gdelt_sentiment.parquet ({df_impact.shape})")
print(f"- gdelt_impact.parquet ({df_impact.shape})")
print(f"- gdelt_articles.parquet ({df_articles.shape})")

WHERE TO FIND THE DATA

Since data has already been processed and stored in Parquet format by us, but resulting files were still too heavy to be uploaded to our github repository, you can instead directly access the normalized GDELT dataset files from our Kaggle folder.

ACLED Data

The ACLED Data we used can be downloaded from the ACLED website after a registration.

Data Cleaning and Imputation

For both datasets we chose to ignore the presence of missing values, as they were not significant in number and did not affect our analysis. The majority of data were already in a clean and usable format, so we did not need to perform any significant data cleaning or imputation steps.

We also decided to follow what the datasets we chose already assumed: in the list of countries we consider there are some that are not actual countries, but rather territories; for example Greenland is considered separately from Denmark.

Data Processing and Analysis

For the data processing and analysis we mainly used Python with libraries such as Pandas, NumPy and Matplotlib to manipulate the data and create preliminary visualizations.

Limitations

Starting with the GDELT dataset, we had to deal with a large volume of data, which required us to reduce it by selecting only the most relevant columns and aggregating it at a weekly level. Moreover, the GDELT dataset is based on media reports, with a huge part that comes from the United States: that may introduce a bias in the data, as it may not fully represent the global media coverage of armed conflicts.

As for the ACLED dataset, it is based on reports from local sources, which may not be as comprehensive or accurate as those from international media outlets. Moreover, the ACLED dataset focuses on political violence and protests, which may not capture all types of armed conflicts.

Plots

After the data processing pipeline, the final plots we produced are:

Waffle Chart
Bar Chart (Normal and Stacked)
Scatter Plot (With Connected Scatter Plot)
Choropleth Map
Chord Diagram
Cumulative Bar Chart
Line Chart x2
Violin Plot/Box Plot x2
Word Cloud