# Setup
## Preconditions
1. Ubuntu 22.04 LTS
2. Python installed (recommended version >= 3.10) (`apt install python3`)
3. Pip installed (recommended version >= 24.1) (`apt install python3-pip`)
4. Optional: pipenv (reccommended version >= 2024.0.1 (`pip3 install pipenv`)
### Install Maxmind Database
```bash
sudo add-apt-repository ppa:maxmind/ppa
sudo apt update
sudo apt install geoipupdate

# Register a new MaxMind account: https://www.maxmind.com/en/geolite2/signup
# Add your AccountID and Licensekey to /etc/GeoIP.conf
vi /etc/GeoIP.conf

# We used GeoLite2-Country_20240517.tar.gz database. Unfortunately, Maxmind does not archives files older than 30 days. Providing our database file would violate license agreements. Therefore, we would recommend to update to the lasted GeoIP database
geoipupdate -v

# Install via geoip2 library via pip
pip install "geoip2>=4.8.0"
```
### Install go 1.23.0
```
wget https://go.dev/dl/go1.23.0.linux-amd64.tar.gz
sudo tar -xvf go1.23.0.linux-amd64.tar.gz go/
sudo mv go /usr/local
export GOROOT=/usr/local/go
mkdir go
export GOPATH=$HOME/go
export PATH=$GOPATH/bin:$GOROOT/bin:$PATH
source ~/.profile
go version
```
### Install cdncheck
Source: https://github.com/projectdiscovery/cdncheck
`go install -v github.com/projectdiscovery/cdncheck/cmd/cdncheck@latest`

## Install Python Packets for fast dns resolution
```
pip install "aiodns>=3.2.0"
```
# Create Tranco 1M Geolocation

## Step 1: Create ccTLD list
- Source: `tranco_V99PN.csv` (Tranco 1M list)
- Output: `tranco_V99PN_ccTLDs.csv` (359,137 sites)
- Script: `./01_extract_ccTLDs.sh`

The script uses `ccTLD.csv` that lists ccTLD, ISO country code, and country name. The list was created using https://www.worldstandards.eu/other/tlds/ (23 January 2024) and  https://en.wikipedia.org/wiki/ISO_3166-1
## Step 2: Create gTLD list
- Source: `tranco_V99PN.csv` and `tranco_V99PN_ccTLDs.csv`
- Output: `tranco_V99PN_gTLDs.csv` (640,863 sites)
- Script: `./02_create_gTLDs.py`

The script extracts all gTLDs from the Tranco 1M and stores them in `tranco_V99PN_gTLDs.csv`
## Step 3: Resolve gTLDs to ip addresses
- Source: `tranco_V99PN_gTLDs.csv`
- Output: `tranco_V99PN_gTLDs_resolved.csv`
- Script: `python3 03_gTLD2ip.py` (input and output files are defined in the code)

The script resolves domains to ip addresses by using four free DNS Servers: Google (https://developers.google.com/speed/public-dns) and Quad 9 (https://www.quad9.net/de/service/service-addresses-and-features/). Due to the high number of requests, there may be errors (e.g., timeouts) during DNS resolution. To minimize these errors, failed requests are retried in two additional rounds.

- Runtime 1st round: real    150m37.440s, user    2m43.530s, sys     0m19.682s
- Resolved: 520,185 `cat tranco_V99PN_gTLDs_resolved.csv | grep -v "Error:" | wc -l`
- Error sum: 46,440+29,970+5,911+38,357=120,678
	- Error: (1, 'DNS server returned answer with no data'): 46,440 `cat tranco_V99PN_gTLDs_resolved.csv | grep "Error: (1," | wc -l`
	- Error: (4, 'Domain name not found'): 29,970 `cat tranco_V99PN_gTLDs_resolved.csv | grep "Error: (4," | wc -l` 
	- Error: (11, 'Could not contact DNS servers'): 5,911 `cat tranco_V99PN_gTLDs_resolved.csv | grep "Error: (11," | wc -l`
	- Error: (12, 'Timeout while contacting DNS servers'): 38,357 `cat tranco_V99PN_gTLDs_resolved.csv | grep "Error: (12," | wc -l`
### 2nd round: all domains with error code 1, 11, and 12:
Create error file for 2nd round:
```
cat tranco_V99PN_gTLDs_resolved.csv | grep "Error: (1," > tranco_V99PN_gTLDs_resolved_errors.csv
cat tranco_V99PN_gTLDs_resolved.csv | grep "Error: (11," >> tranco_V99PN_gTLDs_resolved_errors.csv
cat tranco_V99PN_gTLDs_resolved.csv | grep "Error: (12," >> tranco_V99PN_gTLDs_resolved_errors.csv
```
(Check: 46,440+5,911+38,357=90,708)
- Script: `./03_gTLD2ip_2nd.py` (files)
- Input: `tranco_V99PN_gTLDs_resolved_errors.csv`
- Output: `tranco_V99PN_gTLDs_resolved_2nd.csv`

- Runtime 2nd round: real    19m37.187s, user    0m7.066s, sys     0m1.505s
- Resolved: 27,690 `cat tranco_V99PN_gTLDs_resolved_2nd.csv | grep -v "Error:" | wc -l`
- Error sum: 46,908+1,524+4,854+9,732=63,018
	- Error: (1, 'DNS server returned answer with no data'): 46,908 `cat tranco_V99PN_gTLDs_resolved_2nd.csv | grep "Error: (1," | wc -l`
	- Error: (4, 'Domain name not found'): 1,524 `cat tranco_V99PN_gTLDs_resolved_2nd.csv | grep "Error: (4," | wc -l` 
	- Error: (11, 'Could not contact DNS servers'): 4,854 `cat tranco_V99PN_gTLDs_resolved_2nd.csv | grep "Error: (11," | wc -l`
	- Error: (12, 'Timeout while contacting DNS servers'): 9,732 `cat tranco_V99PN_gTLDs_resolved_2nd.csv | grep "Error: (12," | wc -l`
### 3rd round: all domains with error code 1, 11, and 12:
Create error file for 3rd round:
```
cat tranco_V99PN_gTLDs_resolved_2nd.csv | grep "Error: (1," > tranco_V99PN_gTLDs_resolved_errors_2nd.csv
cat tranco_V99PN_gTLDs_resolved_2nd.csv | grep "Error: (11," >> tranco_V99PN_gTLDs_resolved_errors_2nd.csv
cat tranco_V99PN_gTLDs_resolved_2nd.csv | grep "Error: (12," >> tranco_V99PN_gTLDs_resolved_errors_2nd.csv
```
(Check: 46,908+4,854+9,732=61,494)
- Script: `03_gTLD2ip_3rd.py`
- Input: `tranco_V99PN_gTLDs_resolved_errors_2nd.csv`
- Output: `tranco_V99PN_gTLDs_resolved_3rd.csv`

- Runtime 3rd round:  real    11m53.154s user    0m3.683s sys     0m1.162s
- Resolved: 1,787 `cat tranco_V99PN_gTLDs_resolved_3rd.csv | grep -v "Error:" | wc -l`
- Error sum: 47,271+139+4,330+7,967=59,707
	- Error: (1, 'DNS server returned answer with no data'): 47,271 `cat tranco_V99PN_gTLDs_resolved_3rd.csv | grep "Error: (1," | wc -l`
	- Error: (4, 'Domain name not found'): 139 `cat tranco_V99PN_gTLDs_resolved_4rd.csv | grep "Error: (4," | wc -l` 
	- Error: (11, 'Could not contact DNS servers'): 4,330 `cat tranco_V99PN_gTLDs_resolved_3rd.csv | grep "Error: (11," | wc -l`
	- Error: (12, 'Timeout while contacting DNS servers'): 7,967 `cat tranco_V99PN_gTLDs_resolved_3rd.csv | grep "Error: (12," | wc -l`
### Concatenate results from all 3 rounds
- Concatenate all gTLD domains successfully resolved (549,662 `cat tranco_V99PN_gTLDs_resolved_final.csv | wc -l`) to `tranco_V99PN_gTLDs_resolved_final.csv`:
```
cat tranco_V99PN_gTLDs_resolved.csv | grep -v "Error:" > tranco_V99PN_gTLDs_resolved_final.csv
cat tranco_V99PN_gTLDs_resolved_2nd.csv | grep -v "Error:" >> tranco_V99PN_gTLDs_resolved_final.csv
cat tranco_V99PN_gTLDs_resolved_3rd.csv | grep -v "Error:" >> tranco_V99PN_gTLDs_resolved_final.csv
```
- All domains with error code 4, 'Domain name not found': 29,970+1,524+139=31,633
```
cat tranco_V99PN_gTLDs_resolved.csv | grep "Error: (4," | wc -l
29970
cat tranco_V99PN_gTLDs_resolved_2nd.csv | grep "Error: (4," | wc -l
1524
cat tranco_V99PN_gTLDs_resolved_3rd.csv | grep "Error: (4," | wc -l
139
```
- All remaining domains with other errors (1,11,12): 47,271+4,330+7,967=59,568
```
cat tranco_V99PN_gTLDs_resolved_3rd.csv | grep "Error: (1," | wc -l
47271
cat tranco_V99PN_gTLDs_resolved_2rd.csv | grep "Error: (11," | wc -l
4330
cat tranco_V99PN_gTLDs_resolved_3rd.csv | grep "Error: (12," | wc -l
7967
```
Check: 549,662+31,633+59,568=640,863
## Step 4: Remove CDN Sites
- Input: `tranco_V99PN_gTLDs_resolved_final.csv` 
1. Create file (`cdn.txt`) with all cdn's: `cat tranco_V99PN_gTLDs_resolved_final.csv | awk -F',' '{print $3}' | cdncheck -cdn -resp -silent -nc -o cdn.txt` 
2. Remove duplicate ips: `cat cdn.txt | sort | uniq > cdn_uniq.txt` (14.621 unique cdn domains `cat cdn_uniq.txt | wc -l`)
3. Extract websites served by cdns: `./04_extract_gTLDs_cdn.sh` 
	- Output: `tranco_V99PN_gTLDs_resolved_final_cdn.csv` (35,221 `cat tranco_V99PN_gTLDs_resolved_final_cdn.csv | wc -l`)
## Step 5: Resolve Geolocation from remaining gTLDs
- Create file with websites for geolocation (`tranco_V99PN_gTLDs_resolved_final.csv` without websites in `tranco_V99PN_gTLDs_resolved_final_cdn.csv`): `./05_create_gTLDs_without_cdn.py `
- Output: `tranco_V99PN_gTLDs_resolved_final_nocdn.csv`

Check (514,441+35,221=549,662):
```
cat tranco_V99PN_gTLDs_resolved_final_nocdn.csv | wc -l
514441
cat tranco_V99PN_gTLDs_resolved_final_cdn.csv | wc -l
35221
cat tranco_V99PN_gTLDs_resolved_final.csv | wc -l
549662
```

- Do geolocation for the remaining websites (Input  `tranco_V99PN_gTLDs_resolved_final_nocdn.csv`): `./05_make_list_of_geolocation.sh tranco_V99PN_gTLDs_resolved_final_nocdn.csv`
- Output: `tranco_V99PN_gTLDs_resolved_final_nocdn_geolocation.csv`
## Step 6: Create final Tranco 1m Geolocation file
To create the final Tranco geolocation file, run `06_create_final_geolocation.sh`. This generates the file `tranco_V99PN_geolocation.csv` and `tranco_V99PN_gTLDs_resolved_final_nocdn_geolocation_sorted.csv`.
# Analysis

## SSO Privacy Leaks Analysis
All analyzes are in subfolder `ssoleaks`

1. Get list of unique domains that exposed SSO privacy leaks
	- PLs (10,931): `cat exported_lreq.csv | awk -F',' '{print $1}' | sed 's/"//' | sed 's/https:\/\///' | sort | uniq > 00_domains_pl.csv`
	- FL (2,947): `cat exported_fleaks.csv | awk -F',' '{print $1}' | sed 's/"//' | sed 's/https:\/\///' | sort | uniq > 00_domains_fl.csv`
	- EL (6): `cat exported_eleaks.csv | awk -F',' '{print $1}' | sed 's/"//' | sed 's/https:\/\///' | sort | uniq > 00_domains_el.csv`
2. Create geolocation files for each SSO privacy leak class
	- `./create_ssoprivacy_geolocation.sh 00_domains_pl.csv 00_geolocation_pl.csv`
	- `./create_ssoprivacy_geolocation.sh 00_domains_fl.csv 00_geolocation_fl.csv`
	- `./create_ssoprivacy_geolocation.sh 00_domains_el.csv 00_geolocation_el.csv`
3. Create geolocation files ordered by countries for Tranco 1M and each SSO privacy leak class
	- Tranco 1M:  `cat ../tranco_V99PN_geolocation.csv | awk -F',' '{print $4}' | sort | uniq -c | sort -rn > 01_geolocation_tranco1m_countries.csv`
	- PL: `cat 00_geolocation_pl.csv | awk -F',' '{print $4}' | sort | uniq -c | sort -rn > 01_geolocation_pl_countries.csv`
	- FL: `cat 00_geolocation_fl.csv | awk -F',' '{print $4}' | sort | uniq -c | sort -rn > 01_geolocation_fl_countries.csv`
	- EL: `cat 00_geolocation_el.csv | awk -F',' '{print $4}' | sort | uniq -c | sort -rn > 01_geolocation_el_countries.csv`

Source of geolocation "ccTLD": 
- `cat 00_geolocation_pl.csv | grep "(ccTLD)" | wc -l` (4,260)
- `cat 00_geolocation_fl.csv | grep "(ccTLD)" | wc -l` (1,362)
CDN websites:
- `cat 00_geolocation_pl.csv | grep -F "[cdn]" | wc -l` (991)
- `cat 00_geolocation_fl.csv | grep -F "[cdn]" | wc -l` (148)
Source of geolocation "(maxmind)": 
- `cat 00_geolocation_pl.csv | grep "(maxmind)" | wc -l` (5,644)
- `cat 00_geolocation_fl.csv | grep "(ccTLD)" | wc -l` (1,432)
Unknown countries due to dns resolving errors:
- `cat 00_geolocation_pl.csv | grep -v -F "[cdn]" | grep "unknown" | wc -l` (36)
- `cat 00_geolocation_fl.csv | grep -v -F "[cdn]" | grep "unknown" | wc -l` (5)
Number of countries:
- `cat 01_geolocation_tranco1m_countries.csv | grep -v unknown | wc -l` (249)
- `cat 01_geolocation_pl_countries.csv | grep -v unknown | wc -l` (148)
- `cat 01_geolocation_fl_countries.csv | grep -v unknown | wc -l` (102)

Table 3:
- Number of sites for Tranco 1M: `cat 01_geolocation_tranco1m_countries.csv | grep "US"`
- Number of sites for PLs: `cat 01_geolocation_pl_countries.csv | grep "US"`
- Number of sites for FLs: `cat 01_geolocation_fl_countries.csv | grep "US"`
- Number of sites for ELs: `cat 01_geolocation_el_countries.csv | grep "US"`
## EU
- `eu_countries.csv` includes all EU member states (source: https://eur-lex.europa.eu/EN/legal-content/glossary/member-states.html)
- script: `sum_eu_countries.sh <geolocation.csv>`: Sums up websites of all EU

- Tranco: `./sum_eu_countries.sh 01_geolocation_tranco1m_countries.csv` (188,553)
- PL: `./sum_eu_countries.sh 01_geolocation_pl_countries.csv` (3,023)
- FL: `./sum_eu_countries.sh 01_geolocation_fl_countries.csv` (905)
