Elasticsearch - IDs are hard
Sometimes RTFM (read the f****** manual) is really the best solution, but when building quickly and being agile, there's not always time to read every page of the manual.
I learned recently that Elasticsearch (and Amazon DynamoDB coincidentally) enforces a limit on document IDs. I discovered this because of generated document IDs used map from DynamoDB documents to Elasticsearch documents. For Elasticsearch, the limit of the document ID is 512 bytes. If you are creating document IDs, make sure you account for this limit.
Specifically, the error encountered was id is too long, must be no longer than 512 bytes but was: 513
. Taking a look at the Elasticsearch source code on GitHub, or more specifically the IndexRequest.java class, it is fairly clear how this error is generated. An index request will validate the document being processed to ensure it conforms to internal constraints of Elasticsearch, and if not, it will return a descriptive error for the contstraint that was violated.
Here are two examples from languages I use frequently at work, Python and Java:
def get_len_bytes(a_string):
bytes_of_a_string = bytes(a_string, 'utf-8')
return len(bytes_of_a_string)
public int getLengthBytes(String aString) {
byte[] utf8Bytes = aString.getBytes("UTF-8");
return utf8Bytes.length;
}
To handle this, my team and I discussed possible solutions to allow us to continue saving these documents, even for generated IDs that are too long. Some possible solutions we came up with were:
- Make a hash of the document ID (this would mostly guarantee unique keys, and the document ID is still idempotent)
- Truncate the document ID (less desirable as it's possible to generate duplicate document IDs)
- Reject documents where the document ID is too long
In the end we decided on the 3rd option. For our use case anything longer than 512 bytes is uncommon. So we can take this naive approach and push off handling IDs that are too long to some point in the future.
As noted earlier, we were even finding size limitations with DynamoDB, AWS's managed NoSQL document store. DynamoDB limitations are laid out in Partition & Sort Key Limits.
The exception we encountered in this case was a generic exception thrown by DynamoDB, ValidationException
. Looking at the exception more closely, it was similar to One or more parameter values were invalid: Aggregated size of all range keys has exceeded the size limit of 1024 bytes (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ValidationException; Request ID: SOME_AWS_REQUEST)
. Basically, this means that if you are writing a single record then the range key is too long and should be shortened or the record thrown away.
However, you will most likely see this error when attempting a batch_write_item
(Python) or batchWriteItem
(Java). Here, the error means that given the list of records, the total bytes of all range keys in all records is larger than 1024 bytes, so the request cannot be processed. Oh, and don't forget, batch operations in DynamoDB can only handle up to 100 total items anyway, so if you expect similar sizes of range keys, you have about 120 bytes per range key available.
I came up with a pretty terrible solution to handle both of these cases. First, I built a chunk
function which take a list of things and a chunk_size
, then returns a list of lists where the nested lists are at most the chunk_size
length. "What about the 'aggregated size of all range keys' error you talked about...?" For that case, each chunk from the previous function was handed to a chunk_by_bytes
function. This function is given a list of items to chunk, a field name to chunk by, and a maximum size of the concatenation of all values of that field from the given list. It returns a list of one or more lists where the concatenation of the given field does not exceed the given size. This approach was good enough to resolve the errors seen, except for the cases where a single record was too large. But for those cases, the data is just dropped, and logged out so it can be reviewed later.
So this just leads to another RTFM moment. Partition keys (hash key) can be at most 2048 bytes while sort keys (range key) can be at most 1024 bytes. Don't forget that a DynamoDB item can only be as large as 400 kilobytes, which includes the name(s) of your attribute(s) in UTF-8 encoding. This is important to keep in mind, especially if you attempt to save entire Google Vision API results to a single DynamoDB record, and just swallow exceptions without reporting them, as a coworker of mine discovered recently while building a prototype.
In conclusion
Every technology you touch imposes its own limits on your data. You must work to make sure your data can conform to these limits, or ensure your data can be encoded in some way as to fit within the limits. If you take the route of encoding data, it will be simple for your application to decode. But the downside is that this reduces the ability to search or query the data that is encoded. This also means that your records are much more difficult to work with and use for debugging if a human needs to interact with the data. Your use case and the need for debuggability will determine whether the data should or could be reformatted, rejected, or encoded, and which approach will work best for you, your team, and your application. Personally I would be as transparent as possible, and only keep "good" data in a format that's easy to read for machine and human, as well as mock. Anything else will likely make debugging and observability difficult or impossible.