Methodology
Our investigation into global media coverage patterns combines multiple data sources to understand if global media attention aligns with the severity of conflicts. We analyzed the enormous database provided by the GDELT Project, cross-referenced with fatality data from ACLED's conflict database.
The methodology involves two key steps: data extraction by means of browsing the ACLED website and querying the GDELT database through Google BigQuery, data cleaning using Python Notebooks and Pandas. Below, we detail the technical implementation of our data pipeline and analytical framework.
GDELT Data Extraction
The procedure we followed to extract and analyze data from the GDELT database involved querying the dataset using SQL on Google BigQuery. The initial step was to generate a lookup table as described by an article on the GDELT website.
-- Extract lookup table for domain to country mapping
CREATE OR REPLACE TABLE gdelt-dataviz-project.gdelt_agg.domain_lookup AS
SELECT domain, topcountry.countrycode AS MediaCountryCode
FROM (
SELECT
domain,
-- We take the strict Top 1 result from the array
ARRAY_AGG(STRUCT(countrycode, cnt) ORDER BY cnt DESC LIMIT 1)[OFFSET(0)]
AS topcountry
FROM (
SELECT
-- OFFICIAL GDELT REGEX: Extracts the clean country code
REGEXP_EXTRACT(location, r'^[1-5]#.*?#(.*?)#') AS countrycode,
-- Clean the domain
IFNULL(NET.REG_DOMAIN(DocumentIdentifier), 'unknown') AS domain,
-- Exact Count
COUNT(1) AS cnt
FROM `gdelt-bq.gdeltv2.gkg_partitioned`,
-- OFFICIAL GDELT UNNEST: Explodes the list of locations
UNNEST(SPLIT(V2Locations, ';')) AS location
WHERE
LOWER(DocumentIdentifier) LIKE '%http%'
GROUP BY countrycode, domain
HAVING cnt > 5 AND countrycode IS NOT NULL
)
GROUP BY domain
)
The previous query returned a handy lookup table mapping media domains to their respective country codes based on the geographic distribution of their articles. Below you can see few selected rows of the generated lookup table.
| domain | MediaCountryCode |
|---|---|
| aruba.com | AA |
| dubaitravelblog.com | AE |
| de24live.de | GM |
| genovatoday.it | IT |
| novamallorca.com | SP |
After creating the domain-to-country lookup table, we queried the GDELT database to aggregate media mentions weekly. The primary key of our aggregation consists of the following dimensions: mention_week, conflict_country, actor1_country, actor2_country, media_country.
CREATE OR REPLACE TABLE gdelt-dataviz-project.gdelt_agg.weekly_media_conflict_reports AS
SELECT
-- 1. TIME: Weekly Granularity
DATE_TRUNC(
DATE(PARSE_TIMESTAMP('%Y%m%d%H%M%S', CAST(m.MentionTimeDate AS STRING))),
WEEK(MONDAY)
) AS mention_week,
-- 2. LOCATION: WHERE the conflict is happening
e.ActionGeo_CountryCode AS conflict_country,
-- 3. ACTORS: WHO is involved in the conflict
IFNULL(e.Actor1CountryCode, 'UNK') AS actor1_country,
IFNULL(e.Actor2CountryCode, 'UNK') AS actor2_country,
-- 4. MEDIA SOURCE: WHERE the media is reporting from
-- Uses the GDELT Lookup we created above
IFNULL(dlc_lookup.MediaCountryCode, 'UNK') AS media_country,
-- 5. METRICS (Aggregated)
COUNT(*) AS mentions_count,
COUNT(DISTINCT m.GLOBALEVENTID) AS distinct_events,
COUNT(DISTINCT m.MentionSourceName) AS distinct_media_sources,
-- 6. METRICS: CLASS PROFILE (Volume vs Diversity)
-- Volume (How many articles?)
COUNT(CASE WHEN e.QuadClass = 3 THEN 1 END)
AS verbal_conflict_mentions,
COUNT(CASE WHEN e.QuadClass = 4 THEN 1 END)
AS material_conflict_mentions,
-- Diversity (How many unique events?)
COUNT(DISTINCT CASE WHEN e.QuadClass = 3 THEN m.GLOBALEVENTID END)
AS verbal_conflict_unique_events,
COUNT(DISTINCT CASE WHEN e.QuadClass = 4 THEN m.GLOBALEVENTID END)
AS material_conflict_unique_events,
-- 7. METRICS: ACTOR PROFILE
-- STATE (Gov, Mil, Police, ecc)
COUNT(CASE WHEN
e.Actor1Type1Code IN ('GOV', 'MIL', 'COP', 'JUD', 'LEG', 'SPY') OR
e.Actor2Type1Code IN ('GOV', 'MIL', 'COP', 'JUD', 'LEG', 'SPY')
THEN 1 END) AS state_mentions,
COUNT(DISTINCT CASE WHEN
e.Actor1Type1Code IN ('GOV', 'MIL', 'COP', 'JUD', 'LEG', 'SPY') OR
e.Actor2Type1Code IN ('GOV', 'MIL', 'COP', 'JUD', 'LEG', 'SPY')
THEN m.GLOBALEVENTID END) AS state_unique_events,
-- INSURGENTS (Rebels, Separatists, etc)
COUNT(CASE WHEN
e.Actor1Type1Code IN ('REB', 'INS', 'SEP', 'OPP', 'CRM') OR
e.Actor2Type1Code IN ('REB', 'INS', 'SEP', 'OPP', 'CRM')
THEN 1 END) AS insurgents_mentions,
COUNT(DISTINCT CASE WHEN
e.Actor1Type1Code IN ('REB', 'INS', 'SEP', 'OPP', 'CRM') OR
e.Actor2Type1Code IN ('REB', 'INS', 'SEP', 'OPP', 'CRM')
THEN m.GLOBALEVENTID END) AS insurgents_unique_events,
-- CIVILIANS (The Victims usually)
COUNT(CASE WHEN
e.Actor1Type1Code IN ('CVL', 'REF', 'ELI', 'BUS', 'EDU', 'LAB', 'MED', 'REL', 'AGR') OR
e.Actor2Type1Code IN ('CVL', 'REF', 'ELI', 'BUS', 'EDU', 'LAB', 'MED', 'REL', 'AGR')
THEN 1 END) AS civilians_mentions,
COUNT(DISTINCT CASE WHEN
e.Actor1Type1Code IN ('CVL', 'REF', 'ELI', 'BUS', 'EDU', 'LAB', 'MED', 'REL', 'AGR') OR
e.Actor2Type1Code IN ('CVL', 'REF', 'ELI', 'BUS', 'EDU', 'LAB', 'MED', 'REL', 'AGR')
THEN m.GLOBALEVENTID END) AS civilians_unique_events,
-- INTERNATIONALS (NGO, IGO)
COUNT(CASE WHEN
e.Actor1Type1Code IN ('IGO', 'NGO', 'MNC') OR
e.Actor2Type1Code IN ('IGO', 'NGO', 'MNC')
THEN 1 END) AS international_mentions,
COUNT(DISTINCT CASE WHEN
e.Actor1Type1Code IN ('IGO', 'NGO', 'MNC') OR
e.Actor2Type1Code IN ('IGO', 'NGO', 'MNC')
THEN m.GLOBALEVENTID END) AS international_unique_events,
-- Sentiment Analysis
AVG(m.MentionDocTone) AS avg_tone,
STDDEV(m.MentionDocTone) AS stddev_tone,
APPROX_QUANTILES(m.MentionDocTone, 2)[OFFSET(1)] AS median_tone,
-- Impact (Weighted by mentions)
AVG(e.GoldsteinScale) AS avg_impact,
STDDEV(e.GoldsteinScale) AS stddev_impact,
APPROX_QUANTILES(e.GoldsteinScale, 2)[OFFSET(1)] AS median_impact,
-- Most mentioned article
ARRAY_AGG(
STRUCT(m.MentionIdentifier, e.NumMentions)
ORDER BY e.NumMentions DESC LIMIT 1
)[OFFSET(0)].MentionIdentifier AS top_event_article_url,
-- Most impactful article
ARRAY_AGG(
STRUCT(m.MentionIdentifier, e.GoldsteinScale)
ORDER BY e.GoldsteinScale ASC LIMIT 1
)[OFFSET(0)].MentionIdentifier AS most_impactful_event_article_url
FROM `gdelt-bq.gdeltv2.eventmentions_partitioned` m
JOIN `gdelt-bq.gdeltv2.events_partitioned` e
ON m.GLOBALEVENTID = e.GLOBALEVENTID
-- JOIN LOOKUP
LEFT JOIN `gdelt-dataviz-project.gdelt_agg.lookup_table` dlc_lookup
ON NET.REG_DOMAIN(m.MentionIdentifier) = dlc_lookup.Domain
WHERE
m._PARTITIONTIME >= TIMESTAMP('2015-01-01')
AND e._PARTITIONTIME >= TIMESTAMP('2015-01-01')
-- We only consider material and verbal conflict events
AND e.QuadClass in (3,4)
-- Geographic Filter (Exclude events without country or generalist)
AND e.ActionGeo_CountryCode IS NOT NULL
GROUP BY 1, 2, 3, 4, 5
HAVING mentions_count > 5
In order to focus our attention on just conflict, we used the provided GDELT attribute QuadClass, which categorizes events into four broad classes: Verbal Cooperation (1), Material Cooperation (2), Verbal Conflict (3), and Material Conflict (4). For our analysis, we filtered the dataset to include only events classified as Verbal Conflict and Material Conflict, ensuring that our insights are specifically tailored to understanding media coverage of conflicts.
Below you can see few selected rows of the resulting weekly aggregated media mentions related to conflicts.
| mention_week | conflict_country | actor1_country | actor2_country | media_country | mentions_count | distinct_events | distinct_media_sources | verbal_conflict_mentions | material_conflict_mentions | verbal_conflict_unique_events | material_conflict_unique_events | state_mentions | state_unique_events | insurgents_mentions | insurgents_unique_events | civilians_mentions | civilians_unique_events | international_mentions | international_unique_events | avg_tone | stddev_tone | median_tone | avg_impact | stddev_impact | median_impact | top_event_article_url | most_impactful_event_article_url |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2026-01-12 | UK | UNK | GBR | UP | 16 | 13 | 11 | 3 | 13 | 3 | 10 | 9 | 7 | 0 | 0 | 5 | 4 | 0 | 0 | -2.48 | 2.58 | -2.96 | -6.34 | 3.29 | -7.2 | https://www.unian.ua/world/britaniya-hoche-finansuvati-ukrajinu-za-rahunok-rosiyskoji-nafti-the-times-13254819.html | https://www.eurointegration.com.ua/news/2026/01/12/7228928/ |
| 2026-01-12 | FR | FRA | USA | US | 101 | 28 | 71 | 41 | 60 | 11 | 17 | 17 | 5 | 0 | 0 | 44 | 2 | 0 | 0 | -2.31 | 3.04 | -0.34 | -5.36 | 2.00 | -5.0 | https://www.stltoday.com/news/local/crime-courts/article_ee54c885-b7ea-4670-8669-bf76252d1fd1.html | https://www.thenation.com/article/activism/puerto-rico-mothers-against-war-madres-contra-guerra/ |
| 2026-01-12 | IS | GBR | UNK | UK | 8 | 8 | 5 | 6 | 2 | 6 | 2 | 1 | 1 | 1 | 1 | 3 | 3 | 0 | 0 | -5.19 | 3.18 | -4.31 | -3.69 | 2.46 | -2.0 | https://www.cityam.com/londons-record-low-confidence-blamed-on-labour-workers-rights-bill/ | https://www.thetimes.com/uk/politics/article/badenoch-condemns-police-protesters-israeli-restaurant-tp5fzbq6c | 2026-01-12 | UP | EUR | RUS | UK | 208 | 5 | 105 | 207 | 1 | 4 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | -4.748191 | 0.274187 | -4.770318 | -2.052885 | 0.591478 | -2.0 | https://www.andoveradvertiser.co.uk/news/national/25760295.us-accuses-russia-dangerous-inexplicable-escalation-war-ukraine/ | https://www.thetimes.com/world/russia-ukraine-war/article/russia-ukraine-invasion-longer-soviet-union-ww2-involvement-g8gb2ljw5 |
GDELT Data Management
The dataset we collected from GDELT consists of 7 802 681 rows and 28 columns, which leads to a total size of 2.53 GB. To effectively store this large volume of data locally we utilized the Parquet file format, known for its efficient storage and retrieval capabilities. We also subdivided the dataset into smaller, more manageable chunks, as described in the code snippet below.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json
path_mentions = "../data/GDELT/conflict_data_final.csv"
df_mentions = pd.read_csv(path_mentions)
df_mentions.to_parquet("../data/GDELT/conflict_data_final.parquet", index=False)
key_cols = ['mention_week', 'conflict_country', 'actor1_country', 'actor2_country', 'media_country']
df_keys = df_mentions[key_cols].drop_duplicates().reset_index(drop=True)
df_keys['conflict_id'] = df_keys.index
df_mentions = df_mentions.merge(df_keys, on=key_cols, how='left')
# Now create normalized tables:
save_path = "../data/GDELT/"
# 1. Dimension table (the keys + ID)
df_dimensions = df_mentions[['conflict_id'] + key_cols].drop_duplicates()
df_dimensions.to_parquet(save_path + "gdelt_keys.parquet", index=False)
# 2. Volume metrics (counts and mentions)
df_volume = df_mentions[['conflict_id', 'mentions_count', 'distinct_events', 'distinct_media_sources']]
df_volume.to_parquet(save_path + "gdelt_volume_metrics.parquet", index=False)
# 3. Conflict type metrics (verbal vs material)
df_conflict_types = df_mentions[['conflict_id',
'verbal_conflict_mentions', 'material_conflict_mentions',
'verbal_conflict_unique_events', 'material_conflict_unique_events']]
df_conflict_types.to_parquet(save_path + "gdelt_conflict_types.parquet", index=False)
# 4. Actor type metrics (state, insurgents, civilians, international)
df_actors = df_mentions[['conflict_id',
'state_mentions', 'state_unique_events',
'insurgents_mentions', 'insurgents_unique_events',
'civilians_mentions', 'civilians_unique_events',
'international_mentions', 'international_unique_events']]
df_actors.to_parquet(save_path + "gdelt_actor_metrics.parquet", index=False)
# 5. Sentiment/tone metrics
df_sentiment = df_mentions[['conflict_id',
'avg_tone', 'stddev_tone', 'median_tone']]
df_sentiment.to_parquet(save_path + "gdelt_sentiment.parquet", index=False)
# 6. Impact metrics
df_impact = df_mentions[['conflict_id',
'avg_impact', 'stddev_impact', 'median_impact']]
df_impact.to_parquet(save_path + "gdelt_impact.parquet", index=False)
# 7. Article references (URLs)
df_articles = df_mentions[['conflict_id',
'top_event_article_url', 'most_impactful_event_article_url']]
df_articles.to_parquet(save_path + "gdelt_articles.parquet", index=False)
print("Files created:")
print(f"- gdelt_keys.parquet ({df_dimensions.shape})")
print(f"- gdelt_volume_metrics.parquet ({df_volume.shape})")
print(f"- gdelt_conflict_types.parquet ({df_conflict_types.shape})")
print(f"- gdelt_actor_metrics.parquet ({df_actors.shape})")
print(f"- gdelt_sentiment.parquet ({df_impact.shape})")
print(f"- gdelt_impact.parquet ({df_impact.shape})")
print(f"- gdelt_articles.parquet ({df_articles.shape})")
WHERE TO FIND THE DATA
Since data has already been processed and stored in Parquet format by us, but resulting files were still too heavy to be uploaded to our github repository, you can instead directly access the normalized GDELT dataset files from our Kaggle folder.
ACLED Data
The ACLED Data we used can be downloaded from the ACLED website after a registration.
Data Cleaning and Imputation
For both datasets we chose to ignore the presence of missing values, as they were not significant in number and did not affect our analysis. The majority of data were already in a clean and usable format, so we did not need to perform any significant data cleaning or imputation steps.
We also decided to follow what the datasets we chose already assumed: in the list of countries we consider there are some that are not actual countries, but rather territories; for example Greenland is considered separately from Denmark.
Data Processing and Analysis
For the data processing and analysis we mainly used Python with libraries such as Pandas, NumPy and Matplotlib to manipulate the data and create preliminary visualizations.
Limitations
Starting with the GDELT dataset, we had to deal with a large volume of data, which required us to reduce it by selecting only the most relevant columns and aggregating it at a weekly level. Moreover, the GDELT dataset is based on media reports, with a huge part that comes from the United States: that may introduce a bias in the data, as it may not fully represent the global media coverage of armed conflicts.
As for the ACLED dataset, it is based on reports from local sources, which may not be as comprehensive or accurate as those from international media outlets. Moreover, the ACLED dataset focuses on political violence and protests, which may not capture all types of armed conflicts.
Plots
After the data processing pipeline, the final plots we produced are:
- Waffle Chart
- Bar Chart (Normal and Stacked)
- Scatter Plot (With Connected Scatter Plot)
- Choropleth Map
- Chord Diagram
- Cumulative Bar Chart
- Line Chart x2
- Violin Plot/Box Plot x2
- Word Cloud