WHAD - Wikipedia Historical Attributes Data



Developed as part of the RENDER project.

Available for download at the Wikimedia Deutschland Toolserver and here.

Many Wikipedia entries contain so-called infoboxes: tabular information encoded as (attribute, value) pairs that summarize key information about a given article. A successful use of infoboxes is distant supervision, i.e. using data obtained from infoboxes to semi-automatically annotate a dataset that can be used as training set for a supervised machine learning (ML) algorithm.

The value of some attributes varies over time, and for these it is useful to know the time interval when a given value was applicable. Some examples are the spouse relation for people, the population attribute for a country or the number of employees for a company. For these attributes, we believe that the availability of the edit history of a knowledge base such as Wikipedia will provide insight on which attributes change most often for different entity types, and how values have changed. We think that this will also have applications in distant-supervision systems trying to extract these temporally-changing attributes.

Infobox update extraction

In order to extract infobox updates, we follow a similar approach to (Auer and Lehmann, 2007), which is outlined as follows:

  1. Parse the MediaWiki mark-up to identify infoboxes in all the different revisions for each entry.
  2. Get the infobox type and all the (attribute name, attribute values) pairs contained in it.
  3. Some of the mark-up, such as hyperlinks to other entities in Wikipedia (e.g. if the value of an attribute is the title of a different entry) is also kept, together with the canonical name of the landing page. If, for instance, the link pointed to a redirect page, the canonicalized landing page is obtained from resolving the redirect.

For each entry and revision-timestamp we store an infobox instance extracted for that revision, containing tuples of the following form:

     (attribute; previous-value; current-value; timestamp)
   

Dataset

We are releasing the full, up-to-date, dataset of Wikipedia infobox attribute updates, WHAD2012, for further research. The dataset described in this article will be made available through Wikimedia Deutschland, which proposed to distribute it from its Wikimedia Toolserver download page , under the Creative Commons license that covers Wikipedia. It can also be downloaded here.

The format of the output is in JSON, with the following being an example:

{
  # [... excerpted ...]

  "article_title": "France",
  "attribute": [ {
    "title": "France",
    "timestamp": 1129500148,
    "contributor_ip": "Golbez",
    "key": "GDP_PPP",
    "newvalue": "$1.744 trillion",
    "infobox_name": "Infobox_Country",
    "id": 25688496,
    "comment": "substing infobox"
  }, {
    "title": "France",
    "timestamp": 1129500148,
    "contributor_ip": "Golbez",
    "key": "GDP_PPP_year",
    "newvalue": "2004",
    "infobox_name": "Infobox_Country",
    "id": 25688496,
    "comment": "substing infobox"
  },

  # [... excerpted ...]

  {
    "title": "France",
    "timestamp": 1142610920,
    "contributor_ip": "MJCdetroit",
    "key": "GDP_PPP_year",
    "newvalue": "2005",
    "infobox_name": "Infobox_Country",
    "id": 44224442,
    "comment": "Reformated infobox  & updated it; also expanded the Geography section"
  }, {
    "title": "France",
    "timestamp": 1142610920,
    "contributor_ip": "MJCdetroit",
    "key": "GDP_PPP",
    "newvalue": "$1.816 trillion",
    "infobox_name": "Infobox_Country",
    "id": 44224442,
    "comment": "Reformated infobox  & updated it; also expanded the Geography section"
  }

  # [...]

}
   

Feedback

If you notice any technical or quality issue with the data, please contact us:

  • ggarrido at lsi.uned.es
  • ealfonseca at google.com