Custom sex offenders data harvesting & standardization

The key challenges of the sex offenders project

Diverse data formats.

For instance, one state lists an offender's name as "John A. Doe," while another might use "Doe, John A." Similarly, addresses could be a mixed bag—some in full text, others abbreviated.

Varied search options.

Some websites allow detailed queries by name, address, or offense type, while others have more limited search capabilities.

Website protection.

Websites we had to deal with had different levels of security and protection mechanisms, which meant applying more sophisticated techniques.

Dynamic nature of data.

Offender information can be changed or updated at any moment. So, collected data sometimes becomes incomplete or outdated almost as soon as it is gathered.

Automating aggregation.

Doing this efficiently with free tools posed a significant challenge, as they often lacked the sophistication required for such a nuanced task.

Data validity checks.

We needed to ensure that the data scraped was valid, complete, and reliable.

Need data?

We'll gather it for you from websites of any complexity.

discuss my project

What we did to collect and post-process sex offenders data

Automated data scraping engine

We built a robust scraping engine capable of automatically harvesting data from all the state websites regularly. It also tests the accuracy and completeness of the data to ensure efficient and cost-effective data mining. Typically, in data scraping, the risk of loading false or incomplete data is high because the entire dataset is only visible post-loading (which could take up to a week). We implemented a 'test load' feature. This allows us to verify in advance that all necessary fields are correctly filled and that the data is genuinely new and relevant before committing to a full load.

Automated data loading

We have established a system for automated data loading using the AWS cloud services. It allows us to reduce the time and resources required to transform raw data into usable information. All the data goes to Amazon's S3, and then we move it as per the client's request.

Data consolidation and standardization

Given the diverse formats across different state websites, our engine meticulously converts all data into a uniform format. This includes standardizing text, removing unnecessary symbols, and formatting addresses or names consistently.

Regular updates and maintenance

Our engine is scheduled to refresh the data monthly to keep it current and relevant. Additionally, we continually monitor the performance of our parsers. If a scraper breaks or malfunctions due to changes in a website’s layout or security features, our team promptly addresses and rectifies the issue for uninterrupted data flow.

Technologies

Nannostomus

Serverless computing

Microservices architecture

C#

.NET

AWS EventBridge

AWS SQS

AWS S3

AWS EC2

AWS EFS

AWS ECR

AWS Lambda

Deliverables

The client receives standardized and up-to-date data aggregated from over 50 state websites regularly. This comprehensive collection covers a vast array of sex offender information from across the United States.

Our scrapers pull publicly available data about sex offenders from state websites into CSV files. We break down raw data into separate folders:

Main (names, sex, ethnicity, addresses, phones, emails, work addresses, etc.)
Crime (date of crime and conviction, arresting agency, offense description, sentence, and other details)
Mark (scars, marks, and tattoos)
Alias (additional names)
Vehicle (license plate number, state, year, make, model, color)
Photo (photo file names for each offender)
Files (photos of offenders)

We structure and standardize raw data. The processed data goes into two folders: Offender (data about the convicts) and Offense (data about the nature of the crime). What exactly we do:

Break down data that was presented in a single line before. So, we separate names, addresses, and dates for better accessibility.
Validate data. Replace incorrect information or delete missing values. For example, if the address has the UNKNOWN value or zip is stated as 00000, we'll fix this.
Standardize offense descriptions.
Fix source bugs to ensure an uninterrupted data flow.

Download sample data