Subscribe to our Newsletter

Big data set - 3.5 billion web pages - made available for all of us

This page provides a large hyperlink graph for public download. The graph has been extracted from the Common Crawl 2012 web corpus and covers 3.5 billion web pages and 128 billion hyperlinks between these pages. To the best of our knowledge, the graph is the largest hyperlink graph that is available to the public outside companies such as Google, Yahoo, and Microsoft. Below we provide instructions on how to download the graph as well as basic statistics about its topology.
Web Graph

We hope that the graph will be useful for researchers who develop

  • search algorithms that rank results based on the hyperlinks between pages.
  • SPAM detection methods which identity networks of web pages that are published in order to trick search engines.
  • graph analysis algorithms and can use the hyperlink graph for testing the scalability and performance of their tools.
  • Web Science researchers who want to analyze the linking patterns within specific topical domains in order to identify the social mechanisms that govern these domains.

Contents

1. Levels of Aggregation

We provide the hyperlink graph on four different levels of aggregation:

  • Page-Level Graph - This version of the graph contains all details with each node representing a single web page and each arc a hyperlink between to two pages.
  • Subdomain-Level Graph - This graph aggregates the page graph by subdomain. Each node in the graph represents a specific subdomain (like research.dws.uni-mannheim.de) and a arc exists, if at least one hyperlink was found between pages that belong to a pair of subdomains. Note that subdomains can be of arbitrary depth.
  • First-Level-Subdomain Graph - Each node represents a first level subdomain (like dws.uni-mannheim.de) with all subjacent subdomains aggregated into this domain.
  • Pay-Level-Domain Graph - Each node represents a pay-level-domain (lie uni-mannheim.de). An arc exists if at least one hyperlink was found between pages contained in a pair pay-level-domains.

The table below gives an overview of the size of the different graphs:

Graph #Nodes #Arcs
Page Graph 3,563 million 128,736 million
Subdomain Graph 101 million 2,043 million
1st Level Subdomain Graph 95 million 1,937 million
PLD Graph 43 million 623 million

2. Data Formats and Download

We provide the graphs for free download in several formats. All graphs are provided in an index/arc data format. In addition, we provide the page graph in the format used by the WebGraph library and the PLD graph in the format used by Pajek. The page graphs are hosted on Amazon S3. The aggregated graphs are provided for download via a server in Mannheim, Germany.


2.1 Index/Arc Format

The Index/Arc format represents each graph using two files. Within the index file each line represents one node. The first column states the node name, the second column states the node index. Within the arc file each line represents a directed edge between two nodes, where the first column is the origin node and the second the target node. The files are sorted by index and use tabs as a delimiter. The following example files contain a graph with 106 nodes and 141 arcs.

The following table contains the links for downloading the graphs.

Data Set Index File Arc File
Page Graph see below (45 GB) see below (331 GB)
Subdomain Graph download (832 MB) download (9.2 GB)
1st Subdomain Graph download (757 MB) download (8.7 GB)
PLD Graph download (297 MB) download (2.8 GB)

Get all the data, with instructions, at http://webdatacommons.org/hyperlinkgraph/.

Other Data Sets Available for Download:

Views: 10963

Comment

You need to be a member of Big Data News to add comments!

Join Big Data News

© 2016   BigDataNews.com is a subsidiary of DataScienceCentral LLC and not affiliated with Systap   Powered by

Badges  |  Report an Issue  |  Terms of Service