The ProblemYou’ve just hit 'Run Query' on a standard SQL statement, expecting a clean dataset from S3. Instead, Athena halts and throws a frustrating error:
HIVE_CURSOR_ERROR: Row is not a valid JSON object
This error is deceptive. You might open the file in VS Code, see perfectly highlighted syntax, and wonder what's wrong. The issue usually isn't the JSON syntax itself, but how the data is physically laid out on S3. Athena's default parsers are incredibly strict about file structure.
TL;DR: The Quick FixAthena doesn't read standard JSON files. It requires JSON Lines (ndjson). If your data is pretty-printed (indented over multiple lines) or wrapped in a JSON array (using []), the query will fail immediately.
- Flatten your objects: Every record must occupy exactly one line.- Strip the wrappers: Remove the opening
[and closing]brackets.- Clean the commas: Remove the commas that usually separate objects in a standard array.### Convert Files Instantly with jqIf you havejqinstalled, use this command to transform a pretty-printed file into the compact format Athena demands:
jq -c . input.json > output.jsonl
Why This HappensThe default Hive JSON SerDe (Serializer/Deserializer) is a streaming reader. It does not load your entire 500MB file into memory to parse it as one document. Instead, it scans the file line-by-line, treating every newline character as the start of a new, independent record.
Consider this common snippet:
{
"id": 101,
"status": "active"
}
Athena reads the first line: {. Since a lone opening brace isn't a complete JSON object, the parser panics and kills the query. The same thing happens if your file starts with a [ to denote an array.
Common Scenarios and Fixes### 1. Pretty-Printed JSONThis is the most frequent culprit. While indentation makes logs readable for humans, it breaks Athena. You must flatten these objects into single-line strings.
Incorrect:
{
"user_id": 5001,
"action": "click"
}
{
"user_id": 5002,
"action": "view"
}
Correct (JSON Lines):
{"user_id": 5001, "action": "click"}
{"user_id": 5002, "action": "view"}
2. The Single Malformed RecordIn a dataset with millions of rows, it only takes one missing quote or an unescaped special character to crash a query. I've seen 100GB queries fail at 99% progress because of one stray byte.
Pro Tip: Use a JSON Formatter & Validator to inspect suspicious rows. I keep ToolCraft open to quickly paste and validate problematic lines; it runs locally in your browser, so your sensitive log data never leaves your machine.
3. Graceful Failure with OpenX SerDeIf your data pipeline is messy and you'd rather ignore a few bad rows than fail the whole query, switch to the OpenX SerDe. It includes a specific flag to skip over errors.
Update your CREATE TABLE statement like this:
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'ignore.malformed.json' = 'true'
)
Note that while the query will now finish, any malformed rows will appear as NULL in your results.
Python Script for Bulk FormattingIf you have thousands of files on S3 that need fixing, this Python script will convert standard JSON arrays into the line-delimited format Athena needs:
import json
def repair_athena_json(input_file, output_file):
with open(input_file, 'r') as f:
data = json.load(f)
with open(output_file, 'w') as f:
# Ensure we iterate through arrays or handle single objects
records = data if isinstance(data, list) else [data]
for entry in records:
f.write(json.dumps(entry) + '\n')
# Usage
repair_athena_json('logs_june.json', 'logs_fixed.jsonl')
Final Checklist- Run a Test Query: Use SELECT * FROM table LIMIT 10. If the first few rows load, your basic formatting is likely correct.- Audit for NULLs: If you used the ignore.malformed.json flag, run a count to see how much data you are losing: ```
SELECT count(*) FROM your_table WHERE primary_id IS NULL;
```- Check Encoding: Save your files in UTF-8 without a Byte Order Mark (BOM). Hidden characters at the start of a file can trigger the HIVE_CURSOR_ERROR immediately.## Prevention Strategies- Configure Your Loggers: If you use Python, switch to python-json-logger. For Node.js, use pino. Both support ndjson output by default.- Adopt Parquet: For production environments, JSON is expensive. Convert your data to Parquet using AWS Glue. Parquet is faster to query, cheaper to store, and avoids text-parsing errors entirely.

