Custom sex offenders data harvesting & standardization

Our client approached us with a request to collect data about sex offenders across the USA. The primary purpose of this information was to ensure individuals they assess had no history of sexual or violent offenses. It proved to be efficient in employment for creating safer workspaces and in other spheres.

At a glance.

Duration: Since 2013 - today

Team: 2 people for development, 1 - for project support

Number of processed records: 700,000+

Number of loading modules: 57

Location: USA

The task at hand was to:

1.

Collect requested data from official state websites across the USA.

2.

Transform this diverse data into a uniform format: applying a common letter case, removing special symbols, valid encoding, and name, address, phone, email, date, and other data standardization.

3.

Update the information as per the client's request (check the websites and update the offenders' database).

The key challenges of the sex offenders project

Diverse data formats.

For instance, one state lists an offender's name as "John A. Doe," while another might use "Doe, John A." Similarly, addresses could be a mixed bag—some in full text, others abbreviated.

Varied search options.

Some websites allow detailed queries by name, address, or offense type, while others have more limited search capabilities.

Website protection.

Websites we had to deal with had different levels of security and protection mechanisms, which meant applying more sophisticated techniques.

Dynamic nature of data.

Offender information can be changed or updated at any moment. So, collected data sometimes becomes incomplete or outdated almost as soon as it is gathered.

Automating aggregation.

Doing this efficiently with free tools posed a significant challenge, as they often lacked the sophistication required for such a nuanced task.

Data validity checks.

We needed to ensure that the data scraped was valid, complete, and reliable.

We'll gather it for you from websites of any complexity.

What we did to collect and post-process sex offenders data

Automated data scraping engine

We built a robust scraping engine capable of automatically harvesting data from all the state websites regularly. It also tests the accuracy and completeness of the data to ensure efficient and cost-effective data mining. Typically, in data scraping, the risk of loading false or incomplete data is high because the entire dataset is only visible post-loading (which could take up to a week). We implemented a 'test load' feature. This allows us to verify in advance that all necessary fields are correctly filled and that the data is genuinely new and relevant before committing to a full load.

Automated data loading

We have established a system for automated data loading using the AWS cloud services. It allows us to reduce the time and resources required to transform raw data into usable information. All the data goes to Amazon's S3, and then we move it as per the client's request.

Data consolidation and standardization

Given the diverse formats across different state websites, our engine meticulously converts all data into a uniform format. This includes standardizing text, removing unnecessary symbols, and formatting addresses or names consistently.

Regular updates and maintenance

Our engine is scheduled to refresh the data monthly to keep it current and relevant. Additionally, we continually monitor the performance of our parsers. If a scraper breaks or malfunctions due to changes in a website’s layout or security features, our team promptly addresses and rectifies the issue for uninterrupted data flow.

Technologies

Nannostomus
Serverless computing
Microservices architecture
C#
.NET
AWS EventBridge
AWS SQS
AWS S3
AWS EC2
AWS EFS
AWS ECR
AWS Lambda

Deliverables

The client receives standardized and up-to-date data aggregated from over 50 state websites regularly. This comprehensive collection covers a vast array of sex offender information from across the United States.

Our scrapers pull publicly available data about sex offenders from state websites into CSV files. We break down raw data into separate folders:

  • Main (names, sex, ethnicity, addresses, phones, emails, work addresses, etc.)
  • Crime (date of crime and conviction, arresting agency, offense description, sentence, and other details)
  • Mark (scars, marks, and tattoos)
  • Alias (additional names)
  • Vehicle (license plate number, state, year, make, model, color)
  • Photo (photo file names for each offender)
  • Files (photos of offenders)

We structure and standardize raw data. The processed data goes into two folders: Offender (data about the convicts) and Offense (data about the nature of the crime). What exactly we do:

  • Break down data that was presented in a single line before. So, we separate names, addresses, and dates for better accessibility.
  • Validate data. Replace incorrect information or delete missing values. For example, if the address has the UNKNOWN value or zip is stated as 00000, we'll fix this.
  • Standardize offense descriptions.
  • Fix source bugs to ensure an uninterrupted data flow.