Title: | Get the Category of Content Hosted by a Domain |
---|---|
Description: | Get the category of content hosted by a domain. Use Shallalist <http://shalla.de/>, Virustotal (which provides access to lots of services) <https://www.virustotal.com/>, Alexa <https://aws.amazon.com/awis/>, DMOZ <https://curlie.org/>, University Domain list <https://github.com/Hipo/university-domains-list> or validated machine learning classifiers based on Shallalist data to learn about the kind of content hosted by a domain. |
Authors: | Gaurav Sood [aut, cre] |
Maintainer: | Gaurav Sood <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.2.1 |
Built: | 2024-11-17 04:11:12 UTC |
Source: | https://github.com/cran/rdomains |
Want to know what kind of content is carried on a domain? Get the results quickly using rdomains. The package provides access to virustotal API, shalla, brightcloud, aws, and validated ML model based off shallalist data to predict content of a domain.
To learn how to use rdomains, see this vignette: ../doc/rdomains.html.
Gaurav Sood
Uses a validated ML model that uses keywords in the domain name and suffix to predict probability that the domain hosts adult content. For more information see https://github.com/themains/keyword_porn
adult_ml1_cat(domains = NULL)
adult_ml1_cat(domains = NULL)
domains |
required; string; vector of domain names |
data.frame with original list and content category of the domains
## Not run: adult_ml1_cat("http://www.google.com") ## End(Not run)
## Not run: adult_ml1_cat("http://www.google.com") ## End(Not run)
To learn how to get the Access Key ID and Secret Access Key, see https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html,
clicking on the username followed by security credentials. Either pass the access key and secret or
set two environmental variables AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
.
These environment variables persist within a R session.
alexa_cat(domain = NULL, key = NULL, secret = NULL)
alexa_cat(domain = NULL, key = NULL, secret = NULL)
domain |
domain name |
key |
Alexa Access Key ID |
secret |
Alexa Secret Access Key |
data.frame with 2 columns Title and AbsolutePath
https://docs.aws.amazon.com/AlexaWebInfoService/latest/
## Not run: alexa_cat(domain = "http://www.google.com") ## End(Not run)
## Not run: alexa_cat(domain = "http://www.google.com") ## End(Not run)
Returns category of content from Brighcloud
brightcloud_cat(domain = NULL, key = NULL, secret = NULL)
brightcloud_cat(domain = NULL, key = NULL, secret = NULL)
domain |
domain name |
key |
brightcloud API consumer key |
secret |
brightcloud API consumer secret |
Get the API Consumer Key and Secret from http://www.brightcloud.com/.
named list
## Not run: brightcloud_cat("http://www.google.com", key = "XXXX", secret = "XXXX") ## End(Not run)
## Not run: brightcloud_cat("http://www.google.com", key = "XXXX", secret = "XXXX") ## End(Not run)
Fetches category (or categories) of content hosted by a domain according to DMOZ.
The function checks if path to the DMOZ file is provided by the user.
If not, it looks for dmoz_domain_cateory.csv
in the working directory. It also returns
results for prominent subdomains.
dmoz_cat(domains = NULL, use_file = NULL)
dmoz_cat(domains = NULL, use_file = NULL)
domains |
vector of domain names |
use_file |
path to the dmoz file, which can be downloaded using |
data.frame with original list and content category of the domain
## Not run: dmoz_cat(domains = "http://www.google.com") dmoz_cat(domains = c("http://www.google.com", "http://plus.google.com")) ## End(Not run)
## Not run: dmoz_cat(domains = "http://www.google.com") dmoz_cat(domains = c("http://www.google.com", "http://plus.google.com")) ## End(Not run)
Get Top 1M most visited domains list from Alexa. These data can be used to weight the classification error.
get_alexa_data(outdir = ".", overwrite = FALSE)
get_alexa_data(outdir = ".", overwrite = FALSE)
outdir |
Optional; folder to which you want to save the file; Default is same folder |
overwrite |
Optional; default is FALSE. If TRUE, the file is overwritten. |
https://aws.amazon.com/marketplace/pp/B07QK2XWNV
## Not run: get_alexa_data() ## End(Not run)
## Not run: get_alexa_data() ## End(Not run)
Downloads, unzips and saves archived version of the DMOZ data. For more details, check: https://github.com/themains/rdomains/tree/master/data-raw/dmoz/
get_dmoz_data(outdir = ".", overwrite = FALSE)
get_dmoz_data(outdir = ".", overwrite = FALSE)
outdir |
Optional; folder to which you want to save the file; Default is same folder |
overwrite |
Optional; default is FALSE. If TRUE, the file is overwritten. |
## Not run: get_dmoz_data() ## End(Not run)
## Not run: get_dmoz_data() ## End(Not run)
Shalla has discontinued. We downloaded the last copy (1/14/22).
For more information see data-raw folder on github
Downloads, unzips and saves the latest version of shallalist data. By default, saves shalla data
as shalla_domain_category.csv
.
get_shalla_data(outdir = "./", overwrite = FALSE)
get_shalla_data(outdir = "./", overwrite = FALSE)
outdir |
Optional; folder to which you want to save the file; Default is same folder |
overwrite |
Optional; default is FALSE. If TRUE, the file is overwritten. |
## Not run: get_shalla_data() ## End(Not run)
## Not run: get_shalla_data() ## End(Not run)
ML Model
glm_shalla
glm_shalla
A list
Gaurav Sood
ML model based on shallalist using keywords and domain suffixes,
Based on a slightly amended version of the regular expression used to classify news, and non-news in: “Exposure to ideologically diverse news and opinion on Facebook” by Bakshy, Messing, and Adamic. Science. 2015.
not_news(url_list = NULL)
not_news(url_list = NULL)
url_list |
vector of URLs |
Amendment: sport rather than sports
URL containing any of the following words is classified as soft news: "sport|entertainment|arts|fashion|style|lifestyle|leisure|celeb|movie|music|gossip|food|travel|horoscope|weather|gadget"
URL containing any of following words is classified as hard news: "politi|usnews|world|national|state|elect|vote|govern|campaign|war|polic|econ|unemploy|racis|energy|abortion|educa|healthcare|immigration"
Note that it is based on patterns existing in a small set of domains. See paper for details.
data.frame with 3 columns: url, not_news, news
https://www.science.org/doi/10.1126/science.aaa1160
## Not run: not_news("http://www.bbc.com/sport") not_news(c("http://www.bbc.com/sport", "http://www.washingtontimes.com/news/politics/")) ## End(Not run)
## Not run: not_news("http://www.bbc.com/sport") not_news(c("http://www.bbc.com/sport", "http://www.washingtontimes.com/news/politics/")) ## End(Not run)
Fetches category of content hosted by a domain according to Shalla.
The function checks if path to the shalla file is provided by the user.
If not, it looks for shalla_domain_category.csv
in the working directory.
shalla_cat(domains = NULL, use_file = NULL)
shalla_cat(domains = NULL, use_file = NULL)
domains |
vector of domain names |
use_file |
path to the latest shallalist file downloaded using |
data.frame with original list and content category of the domain
## Not run: shalla_cat(domains = "http://www.google.com") ## End(Not run)
## Not run: shalla_cat(domains = "http://www.google.com") ## End(Not run)
Fetches university domain json from: https://raw.githubusercontent.com/Hipo/university-domains-list/master/world_universities_and_domains.json
uni_cat(domains = NULL)
uni_cat(domains = NULL)
domains |
vector of domain names |
data.frame with original list and all the other columns from the university json
## Not run: uni_cat(domains = "http://www.google.com") ## End(Not run)
## Not run: uni_cat(domains = "http://www.google.com") ## End(Not run)
Returns category of content from 6 major services including: BitDefender, Dr. Web, Alexa (DMOZ), Google, Websense, and Trendmicro. Not all services will have categories for all the domains. When the categories are not returned for a particular domain, we return a NA.
virustotal_cat(domain = NULL, apikey = NULL)
virustotal_cat(domain = NULL, apikey = NULL)
domain |
domain name |
apikey |
virustotal API key |
Get the API Access Key from http://www.virustotal.com/. Either pass the API Key to the function
or set the environmental variable: VirustotalToken
. Environment variables persist within
a R session.
data.frame with 7 columns: domain, bitdefender, dr_web, alexa, google, websense, trendmicro
https://developers.virustotal.com/v2.0/reference
## Not run: virustotal_cat("http://www.google.com") ## End(Not run)
## Not run: virustotal_cat("http://www.google.com") ## End(Not run)