Data Science Meetup
CCRi was delighted to host the second meeting of the Cville Data Science group earlier this month. A full house packed our conference room, and a good time was had by all. The lineup for the talks included three CCRi speakers:
- Jim Hughes – GeoMesa – Spatio-Temporal Indexing in Accumulo
- Andrew Cassidy – Learning From Data Streams Using Online Random Forest
- Nick Hamblet – Semantic Vector Space Embeddings
Slides for these talks should be available on slideshare, but here’s a quick summary, while you’re here:
GeoMesa – Spatio-Temporal Indexing in Accumulo
Jim talked about CCRi’s open-source GeoMesa project, which aims to be for Accumulo what PostGIS is for PostgreSQL. By sharding data across tablet servers and exploiting space-filling curves, we are able to query 5 billion points every second to show near real-time flight data in a web app. Since GeoMesa is easily integrated with the GeoServer stack, setting up the web app is as simple as pointing at a geoserver instance.
Learning From Data Streams Using Online Random Forest
At CCRi, we have written our own online random forest implementation in Scala, using Akka actors so that the computations can be efficiently distributed. Andrew presented an investigation into detecting feature importance with these models, pointing out how our online forests can detect changes in the underlying data distributions over time, and we can visualize the associated change in feature importance. This is especially important for our spatio-temporal incident prediction models, where the forces which lead to events are dynamic.
Semantic Vector Space Embeddings
Finally, Nick presented a sort of literature review of some recent research along the common thread of embedding models. The basic idea is that entities of various types, be they words, sentences, pictures, or semantic triples, can be embedded into a vector space, and the relations between the entities can be detected as transformations of the vector space, and similarity is captured by proximity. Additionally, these models can be merged, so that pictures and words, for example, can be placed in a common space, and unlabeled images can be automatically tagged with impressive accuracy – even with words for which there were no training images!