# Categorization and Reputation
Website categorization was done using Trustedsource (https://trustedsource.org/en/feedback/url) on September, 13th 2024.
## Preconditions
Ubuntu 22.04 LTS
## Step 1: Get list of unique websites/domains that exposed SSO privacy leaks
We begin by processing all Partial Leaks (PLs) identified using SSO-Monitor, which are stored in the file `exported_lreq.csv` (containing 11,189 AuthRequests from 10,931 websites). To create a list of unique websites responsible for these PLs, we execute the following command:
 `cat exported_lreq.csv | awk -F',' '{print $1}' | sed 's/"//' | sort | uniq > 00_domains_pl.csv` (check: `cat 00_domains_pl.csv | wc -l` --> 10,931 websites). 
 
 We do the same for the other two SSO leak types:
- FL (2,947): `cat exported_fleaks.csv | awk -F',' '{print $1}' | sed 's/"//' | sort | uniq > 00_domains_fl.csv` (check `cat 00_domains_fl.csv | wc -l` --> 2,947 websites)
- EL (6): `cat exported_eleaks.csv | awk -F',' '{print $1}' | sed 's/"//' |sort | uniq > 00_domains_el.csv` (check: `cat 00_domains_el.csv | wc -l` --> 6 websites)
## Step 2: Categorize all websites using trustedsource.org
The script `url_categorization.sh` takes as first parameter a file with list of websites and uses the website trustedsource.org from Trellix to get website categorization and reputation information. The scipt limits the number of HTTP requests to 45-50 per minute to avoid overloading trustedsource.org. All raw HTTP responses are saved as file (domain.log) under the subfolder `data/` for later analysis.

```bash
$ time ./url_categorization.sh 00_domains_pl.csv >> url_categorization.csv

real    256m59.515s
user    8m32.623s
sys     3m31.879s
```
### Step 3: Analyis
The script `url_analysis.sh` takes as first parameter file with a list of websites. It uses the files created by `url_categorization.sh` in step 2 to output a csv file with analysis data. The structure is url,status,categrization,reputation.

- PL: `./url_analysis.sh 00_domains_pl.csv > url_categorization_pl.csv`
- FL: `./url_analysis.sh 00_domains_fl.csv > url_categorization_fl.csv`
- EL: `./url_analysis.sh 00_domains_el.csv > url_categorization_el.csv`

#### Number of categorized URLs
```bash
$ cat url_categorization_pl.csv | awk -F ',' '{print $2}' | sort | uniq -c | sort -rn
  10323 Categorized URL
    608 Uncategorized URL
$ cat url_categorization_fl.csv | awk -F ',' '{print $2}' | sort | uniq -c | sort -rn
   2901 Categorized URL
     46 Uncategorized URL
$ cat url_categorization_el.csv | awk -F ',' '{print $2}' | sort | uniq -c | sort -rn
      6 Categorized URL
```
#### Reputation Analysis
```bash
$ cat url_categorization_pl.csv | awk -F ',' '{print $4}' | sort | uniq -c | sort -rn
  10107 Minimal Risk
    712 Unverified
     89 Medium Risk
     23 High Risk
$ cat url_categorization_fl.csv | awk -F ',' '{print $4}' | sort | uniq -c | sort -rn
   2857 Minimal Risk
     69 Unverified
     19 Medium Risk
      2 High Risk
$ cat url_categorization_el.csv | awk -F ',' '{print $4}' | sort | uniq -c | sort -rn
      6 Minimal Risk
```
#### Category Analysis
```bash
$ cat url_categorization_pl.csv | awk -F ',' '{print $3}' | sort | uniq -c | sort -rn
   1449 Online Shopping
   1391 Business
    771 Entertainment
    727 General News
    669 Travel
    608 unknown
    507 Internet Services
    467 Marketing/Merchandising
    394 Education/Reference
    284 Games
    279 Gambling
    256 Forum/Bulletin Boards
    239 Fashion/Beauty
    237 Sports
    207 Blogs/Wiki
    201 Finance/Banking
    181 Job Search
    167 Health
    157 Real Estate
    147 Software/Hardware
    130 Public Information
    102 Recreation/Hobbies
    101 Motor Vehicles
     95 Portal Sites
     78 Non-Profit/Advocacy/NGO
     76 Auctions/Classifieds
     68 Restaurants
     66 Pornography
     62 Interactive Web Applications
     60 Technical/Business Forums
     56 Content Server
     47 Art/Culture/Heritage
     46 Dating/Personals
     45 Government/Military
     42 Social Networking
     37 Weapons
     33 Potential Illegal Software
     29 Internet Radio/TV
     28 Search Engines
     26 Politics/Opinion
     25 PUPs (potentially unwanted programs)
     25 Alcohol
     23 Religion/Ideologies
     20 Stock Trading
     20 Shareware/Freeware
     18 Technical Information
     17 Personal Pages
     17 Personal Network Storage
     17 Media Sharing
     15 Web Mail
     15 Humor/Comics
     14 Major Global Religions
     13 Streaming Media
     13 Provocative Attire
     13 Chat
     12 Malicious Sites
     11 Parked Domain
      9 Pharmacy
      8 Professional Networking
      7 Tobacco
      7 Mobile Phone
      5 Sexual Materials
      4 Web Ads
      4 Media Downloads
      4 Anonymizers
      3 School Cheating Information
      3 Resource Sharing
      3 Gambling Related
      3 Digital Postcards
      2 Web Phone
      2 Text Translators
      2 Nudity
      2 Discrimination
      1 Visual Search Engine
      1 Spyware/Adware/Keyloggers
      1 Spam URLs
      1 Profanity
      1 P2P/File Sharing
      1 Instant Messaging
      1 Information Security
      1 Game/Cartoon Violence
      1 Drugs
      1 Anonymizing Utilities
$ cat url_categorization_fl.csv | awk -F ',' '{print $3}' | sort | uniq -c | sort -rn
    417 Online Shopping
    317 Entertainment
    226 Business
    218 Travel
    179 Marketing/Merchandising
    156 General News
    152 Games
    112 Education/Reference
    102 Blogs/Wiki
     98 Internet Services
     92 Fashion/Beauty
     67 Sports
     57 Public Information
     52 Portal Sites
     50 Recreation/Hobbies
     50 Health
     46 unknown
     44 Gambling
     39 Forum/Bulletin Boards
     37 Software/Hardware
     33 Real Estate
     32 Finance/Banking
     30 Non-Profit/Advocacy/NGO
     27 Dating/Personals
     25 Social Networking
     23 Technical/Business Forums
     20 Restaurants
     20 Art/Culture/Heritage
     19 Auctions/Classifieds
     16 Job Search
     13 Government/Military
     12 Politics/Opinion
     11 Internet Radio/TV
     10 Media Sharing
     10 Humor/Comics
      9 Weapons
      9 Shareware/Freeware
      8 Religion/Ideologies
      8 Motor Vehicles
      6 Provocative Attire
      6 Professional Networking
      6 Pornography
      6 Major Global Religions
      6 Chat
      6 Alcohol
      5 Streaming Media
      5 Potential Illegal Software
      5 Personal Pages
      5 Interactive Web Applications
      4 Technical Information
      4 Stock Trading
      4 Personal Network Storage
      4 Content Server
      3 Sexual Materials
      3 Mobile Phone
      3 Gambling Related
      2 Web Mail
      2 Search Engines
      2 Pharmacy
      2 PUPs (potentially unwanted programs)
      2 Digital Postcards
      1 Web Phone
      1 Visual Search Engine
      1 Spyware/Adware/Keyloggers
      1 School Cheating Information
      1 Resource Sharing
      1 Profanity
      1 Media Downloads
      1 Discrimination
      1 Anonymizing Utilities
      1 Anonymizers
$ cat url_categorization_el.csv | awk -F ',' '{print $3}' | sort | uniq -c | sort -rn
      3 Job Search
      1 Sports
      1 Online Shopping
      1 Business
```

Table 5 in the papaer was created based on the given results.
