Data Platform/Systems/Hive/Avro
Appearance
This describes how to do evolution of avro schemas in hive.
Intro
Despite of what you might read here: https://cwiki.apache.org/confluence/display/Hive/AvroSerDe we have found that avro in hive has several gotchas. Thus we are documenting what has worked for us.
Create tables with avro schema
We have found that easies is to use the avro.schema.literal. property.
Example of how to create a table (if schema is too large you might need to remove all whitespace)
CREATE TABLE nuria_avro
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED as INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
'avro.schema.literal'='
{
"type": "record",
"namespace": "org.wikimedia.analytics.schemas",
"name": "testTable",
"fields": [
{"name": "userName", "type": "string", "default":""},
{"name": "favouriteNumber", "type": "int", "default":1},
{"name": "interests", "type": {"type": "array", "items": "string"}, "default":[]}
]
}');
Update schema
Updating schema requires updating the TBLPROPERTY. Please note that optional fields in avro have to be defined a certain way using unions, see specifications for details: [1]
ALTER TABLE nuria_avro SET TBLPROPERTIES ('avro.schema.literal'='
{
"type": "record",
"namespace": "org.wikimedia.analytics.schemas",
"name": "testTable",
"fields": [
{"name": "userName", "type": "string", "default":""},
{"name":"lastName", "type":["string", "null"], "default":""},
{"name": "favouriteNumber", "type": "int", "default":1},
{"name": "interests", "type": {"type": "array", "items": "string"}, "default":[]}
]
}');
External Tables
When interacting with data that is not created within hive, an external table needs to be created to point at that data.
CREATE EXTERNAL TABLE `avro_test`
PARTITIONED BY (
`year` string,
`month` string,
`day` string,
`hour` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION
'hdfs://analytics-hadoop/wmf/data/raw/mediawiki/mediawiki_testTable/hourly'
TBLPROPERTIES (
'avro.schema.literal'='{
"type": "record",
"namespace": "org.wikimedia.analytics.schemas",
"name": "testTable",
"fields": [
{"name": "userName", "type": "string", "default":""},
{"name":"lastName", "type":["string", "null"], "default":""},
{"name": "favouriteNumber", "type": "int", "default":1}
]
}'
)