User:Joal/JanusGraph
Appearance
This page documents my work-log in playing with JanusGraph.
Links
WDQS
- Wikidata query service
- Wikidata query service/ScalingStrategy
- Talk:Wikidata query service/ScalingStrategy
Wikidata
Janus/Gremlin/Tinkerpop
- https://docs.janusgraph.org/
- https://tinkerpop.apache.org/docs/3.4.1/reference/#_tinkerpop_documentation
- https://github.com/LITMUS-Benchmark-Suite/sparql-to-gremlin
2019-09-06 - Install and tests on Cloud VPS
I have already made an install of JanusGraph on cloud-VPS, but it was almost a year ago at All-Hands. Starting fresh :)
I'm using (JanusGraph needs Java 1.8) and JanusGraph 0.0.4 (latest as of 2019-09-06)
Install and test
- I created the
janus1-1
large instance usingDebian 9.9 Stretch
(java 8 needed) in the cloud-VPS analytics project with Horizon - I followed the introduction section of https://docs.janusgraph.org/, changing ElasticSearch index-backend to Lucene (single node test).
Install
ssh janus1-1.analytics.eqiad.wmflabs
sudo apt-get install unzip openjdk-8-jre
wget https://github.com/JanusGraph/janusgraph/releases/download/v0.4.0/janusgraph-0.4.0-hadoop2.zip
unzip janusgraph-0.4.0-hadoop2.zip
cd janusgraph-0.4.0-hadoop2
./bin/gremlin.sh
Test
/**********************************************
Configure and load graph
**********************************************/
// Create graph with updated configuration (Lucen instead of ES)
graph = JanusGraphFactory.open('conf/janusgraph-berkeleyje-lucene.properties')
// Load graph example
GraphOfTheGodsFactory.load(graph)
// Create graph traversal object
g = graph.traversal()
/**********************************************
Test graph traversal
**********************************************/
// Create a pointer to the Saturn node using index on name
saturn = g.V().has('name', 'saturn').next()
// Show the Saturn node pointer values ([name:[saturn], age:[10000]])
g.V(saturn).valueMap()
// Use the Saturn node pointer to find Saturn grand-child name (hercules)
g.V(saturn).in('father').in('father').values('name')
==>hercules
// Use geo index to find edges having a place property within 50km of Athen (2 results)
g.E().has('place', geoWithin(Geoshape.circle(37.97, 23.72, 50)))
// Find nodes connected to the edges found by geo-index query and show their names (2 results)
g.E().has('place', geoWithin(Geoshape.circle(37.97, 23.72, 50))).
as('source').inV().as('god2').
select('source').outV().as('god1').
select('god1', 'god2').by('name')
2019-09-16 - Analyze and prepare Wikidata-truthy for loading
(started in 2019-09-06 session)
Load dump
import org.apache.spark.sql.functions._
val dump_path = "/user/joal/wmf/data/raw/mediawiki/wikidata/truthy_ntdumps/20190904"
val df = spark.read.format("csv").
option("mode", "FAILFAST").
option("delimiter", " ").
load(dump_path).
withColumnRenamed("_c0", "origin").
withColumnRenamed("_c1", "link").
withColumnRenamed("_c2", "dest").
drop("_c3").
cache()
df.count()
// 4139056936 - Wow!!!
df.where("origin is null or link is null or dest is null").count()
// 0 - \o/ well-formed data
df.select("origin").distinct().count()
// 124151595
df.select("dest").distinct().count()
// 685067856
df.select("link").distinct().count()
// 6516
Analyze and filter links
// Check http://www.wikidata.org/prop links
df.where("link like '<http://www.wikidata.org/prop%'").select("link").distinct.count
// 6486
df.where("link like '<http://www.wikidata.org/prop/direct/%'").select("link").distinct.count
// 6351
df.where("link like '<http://www.wikidata.org/prop/direct-normalized/%'").select("link").distinct.count
// 135 direct or direct-normalized only - GOOD :)
// Check other link types and evaluate whether to keep them or not
df.where("link not like '<http://www.wikidata.org/prop%'").groupBy("link").count.sort(desc("count")).show(100, false)
/*
+----------------------------------------------------+----------+
|link |count |
+----------------------------------------------------+----------+
** To Keep (in addition to direct and direct-normalized links):
|<http://www.w3.org/2002/07/owl#sameAs> |2464024 |
** To remove:
** We drop all language related classes
|<http://schema.org/name> |322876582 |
|<http://schema.org/description> |2014877520|
|<http://www.w3.org/2004/02/skos/core#prefLabel> |322876582 |
|<http://www.w3.org/2000/01/rdf-schema#label> |322876582 |
|<http://www.w3.org/2004/02/skos/core#altLabel> |67929447 |
** We drop metadata
|<http://schema.org/dateModified> |62033634 |
|<http://schema.org/version> |62033306 |
|<http://schema.org/about> |62033306 |
** We drop redondant info (this is described as link-property)
|<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> |121788917 |
// Origin is PXXX and dest is a derivative of PXXX without other usage (origin or dest)
|<http://wikiba.se/ontology#qualifier> |6595 |
|<http://www.w3.org/2002/07/owl#someValuesFrom> |6595 |
|<http://wikiba.se/ontology#claim> |6595 |
|<http://wikiba.se/ontology#statementProperty> |6595 |
|<http://www.w3.org/2002/07/owl#onProperty> |6595 |
|<http://wikiba.se/ontology#referenceValue> |6595 |
|<http://wikiba.se/ontology#reference> |6595 |
|<http://wikiba.se/ontology#directClaim> |6595 |
|<http://wikiba.se/ontology#statementValue> |6595 |
|<http://wikiba.se/ontology#qualifierValue> |6595 |
|<http://wikiba.se/ontology#directClaimNormalized> |4758 |
|<http://wikiba.se/ontology#referenceValueNormalized>|4758 |
|<http://wikiba.se/ontology#statementValueNormalized>|4758 |
|<http://wikiba.se/ontology#qualifierValueNormalized>|4758 |
** Used for dumps info only (a lot of same rows ... weird)
|<http://www.w3.org/2002/07/owl#imports> |328 |
|<http://schema.org/softwareVersion> |328 |
|<http://creativecommons.org/ns#license> |328 |
// Interesting for value interpretation (kept in own dataset)
|<http://wikiba.se/ontology#propertyType> |6595 |
// Values from link is also used as origin -- Seems not used in truthy
|<http://wikiba.se/ontology#novalue> |6595 |
// Used with previous -^
|<http://www.w3.org/2002/07/owl#complementOf> |6595 |
+----------------------------------------------------+----------+
Checking code samples (to be updated for each link type and format):
df.where("link = '<http://www.w3.org/2002/07/owl#someValuesFrom>'").show(20, false)
df.where("link = '<http://www.w3.org/2002/07/owl#someValuesFrom>'").selectExpr("split(origin, '/')[4] as o", "split(dest, '/')[4] as d").where("o <> d").count
df.where("""
origin like '<http://www.wikidata.org/prop/P%'
AND link != '<http://www.w3.org/2002/07/owl#someValuesFrom>'""").show(20, false)
*/
val fdf = df.where("""
-- Dropping descriptions, labels, versions...
link NOT IN (
'<http://schema.org/name>',
'<http://schema.org/description>',
'<http://www.w3.org/2004/02/skos/core#prefLabel>',
'<http://www.w3.org/2000/01/rdf-schema#label>',
'<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>',
'<http://www.w3.org/2004/02/skos/core#altLabel>',
'<http://schema.org/dateModified>',
'<http://schema.org/about>',
'<http://schema.org/version>',
'<http://wikiba.se/ontology#claim>',
'<http://wikiba.se/ontology#statementProperty>',
'<http://wikiba.se/ontology#qualifier>',
'<http://wikiba.se/ontology#directClaim>',
'<http://wikiba.se/ontology#statementValue>',
'<http://wikiba.se/ontology#qualifierValue>',
'<http://wikiba.se/ontology#reference>',
'<http://www.w3.org/2002/07/owl#onProperty>',
'<http://wikiba.se/ontology#referenceValue>',
'<http://wikiba.se/ontology#statementValueNormalized>',
'<http://wikiba.se/ontology#referenceValueNormalized>',
'<http://wikiba.se/ontology#directClaimNormalized>',
'<http://wikiba.se/ontology#qualifierValueNormalized>',
'<http://www.w3.org/2002/07/owl#someValuesFrom>',
'<http://wikiba.se/ontology#novalue>',
'<http://www.w3.org/2002/07/owl#complementOf>',
'<http://wikiba.se/ontology#propertyType>'
) AND origin != '<http://wikiba.se/ontology#Dump>'
""").cache()
fdf.count()
// 825663549 -- Better - Need some naming effort
// checking and defining property types
df.where("link = '<http://wikiba.se/ontology#propertyType>' and origin not like '<http://www.wikidata.org/entity/P%'").count
// 0
val propertyTypes = df.
where("link = '<http://wikiba.se/ontology#propertyType>'").
selectExpr("replace(split(origin, '/')[4], '>', '') AS property", "replace(split(dest, '#')[1], '>', '') as propertyType").
cache()
Rename values for simplicity and size, and pivot some data (names and property-types)
// Check origin values
fdf.where("origin not like '<http://www.wikidata.org/entity/%'").count
// 0 - We have the scheme :)
val fdfr1 = fdf.selectExpr(
"replace(split(origin, '/')[4], '>', '')AS origin",
"""CASE
-- Dropping difference between direct and direct-normalized (only used for ExternalId)
WHEN link like '<http://www.wikidata.org/prop/%' THEN replace(split(link, '/')[5], '>', '')
WHEN link = '<http://www.w3.org/2002/07/owl#sameAs>' THEN 'SameAs'
ELSE link
END as link""",
"""CASE
WHEN link = '<http://schema.org/name>' THEN replace(dest, '@en', '')
ELSE dest
END as dest"""
).cache()
val fdfj1 = fdfr1.join(propertyTypes, col("link") === col("property"), "left").drop("property").cache
fdfj1.groupBy("propertyType").count().sort(desc("count")).show(100, false)
/*
+----------------+---------+
|propertyType |count |
+----------------+---------+
|WikibaseItem |387850621|
|String |169523450|
|ExternalId |135193147|
|Time |32810131 |
|Monolingualtext |27807321 |
|Quantity |9643796 |
|GlobeCoordinate |7603827 |
|CommonsMedia |3651002 |
|Url |3045010 |
|null |2464024 |
|WikibaseProperty|24410 |
|Math |4105 |
|GeoShape |2844 |
|WikibaseLexeme |1299 |
|MusicalNotation |291 |
|TabularData |16 |
|WikibaseSense |13 |
|WikibaseForm |2 |
+----------------+---------+
*/
// Looking for dest renaming scheme
fdfj1.where("propertyType is null").select("link").distinct.show(20, false)
+------+
|link |
+------+
|SameAs|
+------+
fdfj1.where("propertyType = 'ExternalId'").show(20, false)
fdfj1.where("link = 'SameAs' and dest not like '<http://www.wikidata.org/entity/Q%'").count
// 0
fdfj1.where("""propertyType = 'WikibaseProperty'
AND dest not like '<http://www.wikidata.org/entity/P%'
AND dest not like '_:genid%'""").count
// 0
fdfj1.where("""propertyType = 'WikibaseLexeme'
AND dest not like '<http://www.wikidata.org/entity/L%'
""").count
// 0
fdfj1.where("""propertyType = 'WikibaseSense'
AND dest not like '<http://www.wikidata.org/entity/L%'
""").count
// 0
fdfj1.where("""propertyType = 'WikibaseForm'
AND dest not like '<http://www.wikidata.org/entity/L%'
""").count
fdfj1.where("dest like '%^^<%'").groupBy("propertyType").count().sort(desc("count")).show(100, false)
/*
+-------------------------------------------+--------+
|linkPropType |count |
+-------------------------------------------+--------+
|<http://wikiba.se/ontology#Time> |32779728|
|<http://wikiba.se/ontology#Quantity> |9643081 |
|<http://wikiba.se/ontology#GlobeCoordinate>|7602994 |
|<http://wikiba.se/ontology#Math> |4105 |
+-------------------------------------------+--------+
*/
fdfj1.where("dest like '%^^<%'").selectExpr("split(replace(dest, '^^', ';;'), ';;')[1] as typ", "propertyType").groupBy("typ", "propertyType").count().sort(desc("count")).show(100, false)
/*
+-------------------------------------------------+---------------+--------+
|typ |propertyType |count |
+-------------------------------------------------+---------------+--------+
|<http://www.w3.org/2001/XMLSchema#dateTime> |Time |32779728|
|<http://www.w3.org/2001/XMLSchema#decimal> |Quantity |9643081 |
|<http://www.opengis.net/ont/geosparql#wktLiteral>|GlobeCoordinate|7602994 |
|<http://www.w3.org/1998/Math/MathML> |Math |4105 |
+-------------------------------------------------+---------------+--------+
We can get rid of the inner-value type :)
*/
fdfj1.where("""linkProptype = '<http://wikiba.se/ontology#WikibaseItem>'
AND dest not like '<http://www.wikidata.org/entity/Q%'
AND dest not like '_:genid%'""").count
// 0 - \o/ only origin values :)
// Renaming values in 2 stages to remove doule-quotes
val fdfr2 = fdfj1.selectExpr(
"origin",
"link",
"""CASE
WHEN link = 'SameAs' THEN replace(split(dest, '/')[4], '>', '')
WHEN propertyType = 'WikibaseItem' AND dest like '<http://www.wikidata.org/entity/Q%'
THEN replace(split(dest, '/')[4], '>', '')
WHEN propertyType = 'WikibaseProperty' AND dest like '<http://www.wikidata.org/entity/P%'
THEN replace(split(dest, '/')[4], '>', '')
WHEN propertyType IN ('WikibaseLexeme', 'WikibaseSense', 'WikibaseForm')
THEN replace(split(dest, '/')[4], '>', '')
WHEN propertyType IN ('Time', 'Quantity', 'GlobeCoordinate', 'Math') and dest like '%^^<%' THEN split(replace(dest, '^^', ';;'), ';;')[0]
ELSE dest
END as dest""").cache()
val fdfr3 = fdfr2.selectExpr(
"origin",
"link",
"""CASE
WHEN dest rlike '^"[^"]*"$' THEN replace(dest, '"', '')
ELSE dest
END as dest""").cache()
fdfr3.repartition(8).write.mode("overwrite").option("compression", "gzip").json("/user/joal/test_wdqs/truthy_20190916")