Overview

Accumulo has the ability to generate summary statistics about data in a table using user defined functions. Currently, these statistics are only generated for data written to files. Data recently written to Accumulo that is still in memory will not contribute to summary statistics.

This feature can be used to inform a user about what data is in their table. Summary statistics can also be used by compaction strategies to make decisions about which files to compact.

Summary data is stored in each file Accumulo produces. Accumulo can gather summary information from across a cluster merging it along the way. In order for this to be fast, the summary information should fit in cache. There is a dedicated cache for summary data on each tserver with a configurable size. In order for summary data to fit in cache, it should probably be small.

For information on writing a custom summarizer see the javadoc of the Summarizer class. The package org.apache.accumulo.core.client.summary.summarizers contains summarizer implementations that ship with Accumulo and can be configured for use.

Inaccuracies

Summary data can be inaccurate when files are missing summary data or when files have extra summary data. Files can contain data outside of a tablets boundaries. This can happen as result of bulk imported files and tablet splits. When this happens, those files could contain extra summary information. Accumulo offsets this some by storing summary information for multiple row ranges per a file. However, the ranges are not granular enough to completely offset extra data.

Any source of inaccuracies is reported when summary information is requested. In the shell examples below, this can be seen on the File Statistics line. For files missing summary information, the compact command in the shell has a --sf-no-summary option. This options compacts files that do not have the summary information configured for the table. The compact command also has the --sf-extra-summary option which will compact files with extra summary information.

Configuring

The following tablet server and table properties configure summarization.

Permissions

Because summary data may be derived from sensitive data, requesting summary data requires a special permission. Users must have the table permission GET_SUMMARIES in order to retrieve summary data.

Bulk import

When generating RFiles to bulk import into Accumulo, those RFiles can contain summary data. To use this feature, look at the javadoc of summarizers() in the configure() method of AccumuloFileOutputFormat. Also, the RFile class has options for creating RFiles with embedded summary data.

Examples

This example walks through using summarizers in the Accumulo shell. Below, a table is created and some data is inserted to summarize.

  1. root@uno> createtable summary_test
  2. root@uno summary_test> setauths -u root -s PI,GEO,TIME
  3. root@uno summary_test> insert 3b503bd name last Doe
  4. root@uno summary_test> insert 3b503bd name first John
  5. root@uno summary_test> insert 3b503bd contact address "123 Park Ave, NY, NY" -l PI&GEO
  6. root@uno summary_test> insert 3b503bd date birth "1/11/1942" -l PI&TIME
  7. root@uno summary_test> insert 3b503bd date married "5/11/1962" -l PI&TIME
  8. root@uno summary_test> insert 3b503bd contact home_phone 1-123-456-7890 -l PI
  9. root@uno summary_test> insert d5d18dd contact address "50 Lake Shore Dr, Chicago, IL" -l PI&GEO
  10. root@uno summary_test> insert d5d18dd name first Jane
  11. root@uno summary_test> insert d5d18dd name last Doe
  12. root@uno summary_test> insert d5d18dd date birth 8/15/1969 -l PI&TIME
  13. root@uno summary_test> scan -s PI,GEO,TIME
  14. 3b503bd contact:address [PI&GEO] 123 Park Ave, NY, NY
  15. 3b503bd contact:home_phone [PI] 1-123-456-7890
  16. 3b503bd date:birth [PI&TIME] 1/11/1942
  17. 3b503bd date:married [PI&TIME] 5/11/1962
  18. 3b503bd name:first [] John
  19. 3b503bd name:last [] Doe
  20. d5d18dd contact:address [PI&GEO] 50 Lake Shore Dr, Chicago, IL
  21. d5d18dd date:birth [PI&TIME] 8/15/1969
  22. d5d18dd name:first [] Jane
  23. d5d18dd name:last [] Doe

After inserting the data, summaries are requested below. No summaries are returned.

  1. root@uno summary_test> summaries

The visibility summarizer is configured below and the table is flushed. Flushing the table creates a file creating summary data in the process. The summary data returned counts how many times each column visibility occurred. The statistics with a c: prefix are visibilities. The others are generic statistics created by the CountingSummarizer that VisibilitySummarizer extends.

  1. root@uno summary_test> config -t summary_test -s table.summarizer.vis=org.apache.accumulo.core.client.summary.summarizers.VisibilitySummarizer
  2. root@uno summary_test> summaries
  3. root@uno summary_test> flush -w
  4. 2017-02-24 19:54:46,090 [shell.Shell] INFO : Flush of table summary_test completed.
  5. root@uno summary_test> summaries
  6. Summarizer : org.apache.accumulo.core.client.summary.summarizers.VisibilitySummarizer vis {}
  7. File Statistics : [total:1, missing:0, extra:0, large:0]
  8. Summary Statistics :
  9. c: = 4
  10. c:PI = 1
  11. c:PI&GEO = 2
  12. c:PI&TIME = 3
  13. emitted = 10
  14. seen = 10
  15. tooLong = 0
  16. tooMany = 0

VisibilitySummarizer has an option maxCounters that determines the max number of column visibilities it will track. Below this option is set and compaction is forced to regenerate summary data. The new summary data only has three visibilities and now the tooMany statistic is 4. This is the number of visibilities that were not counted.

  1. root@uno summary_test> config -t summary_test -s table.summarizer.vis.opt.maxCounters=3
  2. root@uno summary_test> compact -w
  3. 2017-02-24 19:54:46,267 [shell.Shell] INFO : Compacting table ...
  4. 2017-02-24 19:54:47,127 [shell.Shell] INFO : Compaction of table summary_test completed for given range
  5. root@uno summary_test> summaries
  6. Summarizer : org.apache.accumulo.core.client.summary.summarizers.VisibilitySummarizer vis {maxCounters=3}
  7. File Statistics : [total:1, missing:0, extra:0, large:0]
  8. Summary Statistics :
  9. c:PI = 1
  10. c:PI&GEO = 2
  11. c:PI&TIME = 3
  12. emitted = 10
  13. seen = 10
  14. tooLong = 0
  15. tooMany = 4

Another summarizer is configured below that tracks the number of deletes. Also, a compaction strategy that uses this summary data is configured. The TooManyDeletesCompactionStrategy will force a compaction of the tablet when the ratio of deletes to non-deletes is over 25%. This threshold is configurable. Below a delete is added and it’s reflected in the statistics. In this case there is 1 delete and 10 non-deletes, not enough to force a compaction of the tablet.

  1. root@uno summary_test> config -t summary_test -s table.summarizer.del=org.apache.accumulo.core.client.summary.summarizers.DeletesSummarizer
  2. root@uno summary_test> compact -w
  3. 2017-02-24 19:54:47,282 [shell.Shell] INFO : Compacting table ...
  4. 2017-02-24 19:54:49,236 [shell.Shell] INFO : Compaction of table summary_test completed for given range
  5. root@uno summary_test> config -t summary_test -s table.compaction.major.ratio=10
  6. root@uno summary_test> config -t summary_test -s table.majc.compaction.strategy=org.apache.accumulo.tserver.compaction.strategies.TooManyDeletesCompactionStrategy
  7. root@uno summary_test> deletemany -r d5d18dd -c date -f
  8. [DELETED] d5d18dd date:birth [PI&TIME]
  9. root@uno summary_test> flush -w
  10. 2017-02-24 19:54:49,686 [shell.Shell] INFO : Flush of table summary_test completed.
  11. root@uno summary_test> summaries
  12. Summarizer : org.apache.accumulo.core.client.summary.summarizers.VisibilitySummarizer vis {maxCounters=3}
  13. File Statistics : [total:2, missing:0, extra:0, large:0]
  14. Summary Statistics :
  15. c:PI = 1
  16. c:PI&GEO = 2
  17. c:PI&TIME = 4
  18. emitted = 11
  19. seen = 11
  20. tooLong = 0
  21. tooMany = 4
  22. Summarizer : org.apache.accumulo.core.client.summary.summarizers.DeletesSummarizer del {}
  23. File Statistics : [total:2, missing:0, extra:0, large:0]
  24. Summary Statistics :
  25. deletes = 1
  26. total = 11

Some more deletes are added and the table is flushed below. This results in 4 deletes and 10 non-deletes, which triggers a full compaction. A full compaction of all files is the only time when delete markers are dropped. The compaction ratio was set to 10 above to show that the number of files did not trigger the compaction. After the compaction there no deletes 6 non-deletes.

  1. root@uno summary_test> deletemany -r d5d18dd -f
  2. [DELETED] d5d18dd contact:address [PI&GEO]
  3. [DELETED] d5d18dd name:first []
  4. [DELETED] d5d18dd name:last []
  5. root@uno summary_test> flush -w
  6. 2017-02-24 19:54:52,800 [shell.Shell] INFO : Flush of table summary_test completed.
  7. root@uno summary_test> summaries
  8. Summarizer : org.apache.accumulo.core.client.summary.summarizers.VisibilitySummarizer vis {maxCounters=3}
  9. File Statistics : [total:1, missing:0, extra:0, large:0]
  10. Summary Statistics :
  11. c:PI = 1
  12. c:PI&GEO = 1
  13. c:PI&TIME = 2
  14. emitted = 6
  15. seen = 6
  16. tooLong = 0
  17. tooMany = 2
  18. Summarizer : org.apache.accumulo.core.client.summary.summarizers.DeletesSummarizer del {}
  19. File Statistics : [total:1, missing:0, extra:0, large:0]
  20. Summary Statistics :
  21. deletes = 0
  22. total = 6
  23. root@uno summary_test>