Many Wikipedia entries contain so-called infoboxes: tabular
information encoded as (attribute, value) pairs that summarize key
information about a given article. A successful use of infoboxes is
distant supervision, i.e. using data obtained from infoboxes to
semi-automatically annotate a dataset that can be used as training set for
a supervised machine learning (ML) algorithm.
The value of some attributes varies over time, and for these it is
useful to know the time interval when a given value was applicable.
Some examples are the spouse relation for people, the
population attribute for a country or the number of
employees for a company. For these attributes, we believe that the
availability of the edit history of a knowledge base such as Wikipedia
will provide insight on which attributes change most often for different
entity types, and how values have changed. We think that this will
also have applications in distant-supervision systems trying to
extract these temporally-changing attributes.
Infobox update extraction
In order to extract infobox updates, we follow a similar approach to (Auer
and Lehmann, 2007), which is outlined as follows:
- Parse the MediaWiki mark-up to identify infoboxes in all the different revisions for each entry.
- Get the infobox type and all the (attribute name, attribute values) pairs contained in it.
- Some of the mark-up, such as hyperlinks to other
entities in Wikipedia (e.g. if the value of an attribute
is the title of a different entry) is also kept, together
with the canonical name of the landing page. If,
for instance, the link pointed to a redirect page,
the canonicalized landing page is obtained from
resolving the redirect.
For each entry and revision-timestamp we store an infobox instance
extracted for that revision, containing tuples of the following form:
(attribute; previous-value; current-value; timestamp)
Dataset
We are releasing the full, up-to-date, dataset of Wikipedia infobox
attribute updates, WHAD2012, for further research. The dataset described
in this article will be made available through Wikimedia Deutschland,
which proposed to distribute it from
its Wikimedia Toolserver download
page , under the Creative Commons license that covers Wikipedia.
It can also be downloaded
here.
The format of the output is in JSON, with the following being an example:
{
# [... excerpted ...]
"article_title": "France",
"attribute": [ {
"title": "France",
"timestamp": 1129500148,
"contributor_ip": "Golbez",
"key": "GDP_PPP",
"newvalue": "$1.744 trillion",
"infobox_name": "Infobox_Country",
"id": 25688496,
"comment": "substing infobox"
}, {
"title": "France",
"timestamp": 1129500148,
"contributor_ip": "Golbez",
"key": "GDP_PPP_year",
"newvalue": "2004",
"infobox_name": "Infobox_Country",
"id": 25688496,
"comment": "substing infobox"
},
# [... excerpted ...]
{
"title": "France",
"timestamp": 1142610920,
"contributor_ip": "MJCdetroit",
"key": "GDP_PPP_year",
"newvalue": "2005",
"infobox_name": "Infobox_Country",
"id": 44224442,
"comment": "Reformated infobox & updated it; also expanded the Geography section"
}, {
"title": "France",
"timestamp": 1142610920,
"contributor_ip": "MJCdetroit",
"key": "GDP_PPP",
"newvalue": "$1.816 trillion",
"infobox_name": "Infobox_Country",
"id": 44224442,
"comment": "Reformated infobox & updated it; also expanded the Geography section"
}
# [...]
}
Feedback
If you notice any technical or quality issue with the data, please
contact us:
- ggarrido at lsi.uned.es
- ealfonseca at google.com
|