Package 'rdomains' reference manual

Title:	Get the Category of Content Hosted by a Domain
Description:	Get the category of content hosted by a domain. Use Shallalist <http://shalla.de/>, Virustotal (which provides access to lots of services) <https://www.virustotal.com/>, Alexa <https://aws.amazon.com/awis/>, DMOZ <https://curlie.org/>, University Domain list <https://github.com/Hipo/university-domains-list> or validated machine learning classifiers based on Shallalist data to learn about the kind of content hosted by a domain.
Authors:	Gaurav Sood [aut, cre]
Maintainer:	Gaurav Sood <[email protected]>
License:	MIT + file LICENSE
Version:	0.2.1
Built:	2025-03-17 05:05:16 UTC
Source:	https://github.com/cran/rdomains

rdomains: Classify Domains by their Content

Description

Want to know what kind of content is carried on a domain? Get the results quickly using rdomains. The package provides access to virustotal API, shalla, brightcloud, aws, and validated ML model based off shallalist data to predict content of a domain.

To learn how to use rdomains, see this vignette: ../doc/rdomains.html.

Author(s)

Gaurav Sood

Probability that Domain Hosts Adult Content Based on features of Domain Name and Suffix alone.

Description

Uses a validated ML model that uses keywords in the domain name and suffix to predict probability that the domain hosts adult content. For more information see https://github.com/themains/keyword_porn

Usage

adult_ml1_cat(domains = NULL)
adult_ml1_cat(domains = NULL)

Arguments

domains

required; string; vector of domain names

Value

data.frame with original list and content category of the domains

Examples

## Not run: 
adult_ml1_cat("http://www.google.com")

## End(Not run)
## Not run: 
adult_ml1_cat("http://www.google.com")

## End(Not run)

Get Category from Alexa

Description

To learn how to get the Access Key ID and Secret Access Key, see https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html, clicking on the username followed by security credentials. Either pass the access key and secret or set two environmental variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. These environment variables persist within a R session.

Usage

alexa_cat(domain = NULL, key = NULL, secret = NULL)
alexa_cat(domain = NULL, key = NULL, secret = NULL)

Arguments

`domain`	domain name
`key`	Alexa Access Key ID
`secret`	Alexa Secret Access Key

Value

data.frame with 2 columns Title and AbsolutePath

References

https://docs.aws.amazon.com/AlexaWebInfoService/latest/

Examples

## Not run: 
alexa_cat(domain = "http://www.google.com")

## End(Not run)
## Not run: 
alexa_cat(domain = "http://www.google.com")

## End(Not run)

Get Category from Brightcloud

Description

Returns category of content from Brighcloud

Usage

brightcloud_cat(domain = NULL, key = NULL, secret = NULL)
brightcloud_cat(domain = NULL, key = NULL, secret = NULL)

Arguments

`domain`	domain name
`key`	brightcloud API consumer key
`secret`	brightcloud API consumer secret

Details

Get the API Consumer Key and Secret from http://www.brightcloud.com/.

Value

named list

References

http://www.brightcloud.com/

Examples

## Not run: 
brightcloud_cat("http://www.google.com", key = "XXXX", secret = "XXXX")

## End(Not run)

## Not run: 
brightcloud_cat("http://www.google.com", key = "XXXX", secret = "XXXX")

## End(Not run)

Get Category from DMOZ

Description

Fetches category (or categories) of content hosted by a domain according to DMOZ. The function checks if path to the DMOZ file is provided by the user. If not, it looks for dmoz_domain_cateory.csv in the working directory. It also returns results for prominent subdomains.

Usage

dmoz_cat(domains = NULL, use_file = NULL)
dmoz_cat(domains = NULL, use_file = NULL)

Arguments

`domains`	vector of domain names
`use_file`	path to the dmoz file, which can be downloaded using `get_dmoz_data`

Value

data.frame with original list and content category of the domain

Examples

## Not run: 
dmoz_cat(domains = "http://www.google.com")
dmoz_cat(domains = c("http://www.google.com", "http://plus.google.com"))

## End(Not run)
## Not run: 
dmoz_cat(domains = "http://www.google.com")
dmoz_cat(domains = c("http://www.google.com", "http://plus.google.com"))

## End(Not run)

Get Alexa Traffic Data

Description

Get Top 1M most visited domains list from Alexa. These data can be used to weight the classification error.

Usage

get_alexa_data(outdir = ".", overwrite = FALSE)
get_alexa_data(outdir = ".", overwrite = FALSE)

Arguments

`outdir`	Optional; folder to which you want to save the file; Default is same folder
`overwrite`	Optional; default is FALSE. If TRUE, the file is overwritten.

References

https://aws.amazon.com/marketplace/pp/B07QK2XWNV

Examples

## Not run: 
get_alexa_data()

## End(Not run)
## Not run: 
get_alexa_data()

## End(Not run)

Get DMOZ Data

Description

Downloads, unzips and saves archived version of the DMOZ data. For more details, check: https://github.com/themains/rdomains/tree/master/data-raw/dmoz/

Usage

get_dmoz_data(outdir = ".", overwrite = FALSE)
get_dmoz_data(outdir = ".", overwrite = FALSE)

Arguments

`outdir`	Optional; folder to which you want to save the file; Default is same folder
`overwrite`	Optional; default is FALSE. If TRUE, the file is overwritten.

References

https://dmoztools.net

Examples

## Not run: 
get_dmoz_data()

## End(Not run)
## Not run: 
get_dmoz_data()

## End(Not run)

Get Shalla Data

Description

Shalla has discontinued. We downloaded the last copy (1/14/22). For more information see data-raw folder on github Downloads, unzips and saves the latest version of shallalist data. By default, saves shalla data as shalla_domain_category.csv.

Usage

get_shalla_data(outdir = "./", overwrite = FALSE)
get_shalla_data(outdir = "./", overwrite = FALSE)

Arguments

`outdir`	Optional; folder to which you want to save the file; Default is same folder
`overwrite`	Optional; default is FALSE. If TRUE, the file is overwritten.

References

http://www.shallalist.de/

Examples

## Not run: 
get_shalla_data()

## End(Not run)
## Not run: 
get_shalla_data()

## End(Not run)

ML Model

Description

ML Model

Usage

glm_shalla
glm_shalla

Format

A list

Author(s)

Gaurav Sood

Source

ML model based on shallalist using keywords and domain suffixes,

Classify News and Non-News Based on keywords in the URL

Description

Based on a slightly amended version of the regular expression used to classify news, and non-news in: “Exposure to ideologically diverse news and opinion on Facebook” by Bakshy, Messing, and Adamic. Science. 2015.

Usage

not_news(url_list = NULL)
not_news(url_list = NULL)

Arguments

url_list

vector of URLs

Details

Amendment: sport rather than sports

Note that it is based on patterns existing in a small set of domains. See paper for details.

Value

data.frame with 3 columns: url, not_news, news

References

https://www.science.org/doi/10.1126/science.aaa1160

Examples

## Not run: 
not_news("http://www.bbc.com/sport")
not_news(c("http://www.bbc.com/sport", "http://www.washingtontimes.com/news/politics/"))

## End(Not run)
## Not run: 
not_news("http://www.bbc.com/sport")
not_news(c("http://www.bbc.com/sport", "http://www.washingtontimes.com/news/politics/"))

## End(Not run)

Get Category from Shallalist

Description

Fetches category of content hosted by a domain according to Shalla. The function checks if path to the shalla file is provided by the user. If not, it looks for shalla_domain_category.csv in the working directory.

Usage

shalla_cat(domains = NULL, use_file = NULL)
shalla_cat(domains = NULL, use_file = NULL)

Arguments

`domains`	vector of domain names
`use_file`	path to the latest shallalist file downloaded using `get_shalla_data`

Value

data.frame with original list and content category of the domain

Examples

## Not run: 
shalla_cat(domains = "http://www.google.com")

## End(Not run)
## Not run: 
shalla_cat(domains = "http://www.google.com")

## End(Not run)

Get Category from University Domain List

Description

Fetches university domain json from: https://raw.githubusercontent.com/Hipo/university-domains-list/master/world_universities_and_domains.json

Usage

uni_cat(domains = NULL)
uni_cat(domains = NULL)

Arguments

domains

vector of domain names

Value

data.frame with original list and all the other columns from the university json

Examples

## Not run: 
uni_cat(domains = "http://www.google.com")

## End(Not run)
## Not run: 
uni_cat(domains = "http://www.google.com")

## End(Not run)

Get Category from Virustotal

Description

Returns category of content from 6 major services including: BitDefender, Dr. Web, Alexa (DMOZ), Google, Websense, and Trendmicro. Not all services will have categories for all the domains. When the categories are not returned for a particular domain, we return a NA.

Usage

virustotal_cat(domain = NULL, apikey = NULL)
virustotal_cat(domain = NULL, apikey = NULL)

Arguments

`domain`	domain name
`apikey`	virustotal API key

Details

Get the API Access Key from http://www.virustotal.com/. Either pass the API Key to the function or set the environmental variable: VirustotalToken. Environment variables persist within a R session.

Value

data.frame with 7 columns: domain, bitdefender, dr_web, alexa, google, websense, trendmicro

References

https://developers.virustotal.com/v2.0/reference

Examples

## Not run: 
virustotal_cat("http://www.google.com")

## End(Not run)
## Not run: 
virustotal_cat("http://www.google.com")

## End(Not run)

Package 'rdomains'

Help Index

rdomains: Classify Domains by their Content

Description

Author(s)

Probability that Domain Hosts Adult Content Based on features of Domain Name and Suffix alone.

Description

Usage

Arguments

Value

Examples

Get Category from Alexa

Description

Usage

Arguments

Value

References

Examples

Get Category from Brightcloud

Description

Usage

Arguments

Details

Value

References

Examples

Get Category from DMOZ

Description

Usage

Arguments

Value

Examples

Get Alexa Traffic Data

Description

Usage

Arguments

References

Examples

Get DMOZ Data

Description

Usage

Arguments

References

Examples

Get Shalla Data

Description

Usage

Arguments

References

Examples

ML Model

Description

Usage

Format

Author(s)

Source

Classify News and Non-News Based on keywords in the URL

Description

Usage

Arguments

Details

Value

References

Examples

Get Category from Shallalist

Description

Usage

Arguments

Value

Examples

Get Category from University Domain List

Description

Usage

Arguments

Value

Examples

Get Category from Virustotal

Description

Usage

Arguments