Skip to main content

7 posts tagged with "DevOps"

Development operations and CI/CD

View All Tags

Crawlab AI: Building Intelligent Web Scrapers with Large Language Models (LLM)

· 6 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

"If I had asked people what they wanted, they would have said faster horses" -- Henry Ford

Preface

When I first entered the workforce, as a data analyst, I accidentally experienced the ability of web crawlers to automatically extract webpage data, and since then I've been fascinated by this magical technology. As I continued to delve into web scraping technology, I gradually understood the core technologies of web crawling, including web parsing - the process of analyzing webpage HTML structure to build data extraction rules based on XPath or CSS Selectors. This process has long required manual intervention. While relatively simple for scraping engineers, if large-scale extraction is needed, this process is very time-consuming, and as webpage structures change, it increases crawler maintenance costs. This article will introduce my LLM-based intelligent web scraping product: Crawlab AI. Although it's still in early development, it has already shown great potential and promises to make data acquisition easy for data practitioners.

As the founder of the web scraping management platform Crawlab, I've always been passionate about making data acquisition simple and easy. Through constant communication with data practitioners, I realized the massive demand for intelligent scrapers (or universal scrapers) - extracting target data from any website without manually writing parsing rules. Of course, I'm not the only one researching and trying to solve this problem: In January 2020, Qingnan released the universal article parsing library GeneralNewsExtractor based on punctuation density, which can implement universal news crawlers with 4 lines of code; In July 2020, Cui Qingcai released GerapyAutoExtractor, implementing list page data extraction based on SVM algorithms; In April 2023, I developed Webspot through high-dimensional vector clustering algorithms, which can also automatically extract list pages. The main problem with these open-source software is that their recognition accuracy has some gaps compared to manually written crawler rules.

Additionally, commercial scraping software Diffbot and Octoparse have also implemented some universal data extraction functionality through proprietary machine learning algorithms. Unfortunately, their usage costs are relatively high. For example, Diffbot's lowest plan requires a monthly subscription fee of $299.

Exploring Crawlab: Your New Enterprise Web Scraping Management Choice

· 3 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

In the modern data-driven era, acquiring and managing online information has become crucial. To provide powerful support for enterprises and developers, Crawlab has emerged as an enterprise-level web scraping management platform characterized by being ready-to-use out of the box. Regardless of your team size, Crawlab can provide professional and efficient web scraping management solutions.

Core Features

Crawlab's core features include distributed system management, spider task management and scheduling, file editing, message notifications, dependency management, Git integration, and performance monitoring, among others. Its distributed node management allows spider programs to run efficiently across multiple servers. No more worrying about manual uploading, monitoring, and deployment hassles - Crawlab automates all of this, ensuring you can easily schedule spider tasks and view spider program running status and task logs in real-time.

Spider List

Key Highlights

Practical Data Analysis: Open Source Automated Data Exploration Tool Rath

· 5 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

Exploratory Data Analysis (EDA) is a task that data analysts or data scientists frequently need to complete when facing datasets. Using Python tools like Pandas and Seaborn can easily accomplish univariate analysis, bi-variate analysis, and multi-variate analysis, but using them for data exploration not only has certain technical barriers but also requires manually writing scripts for data operations and analysis. This article will introduce a very cool automated data exploration open source tool Rath, which can automatically complete EDA and become the Autopilot or Copilot of the data analysis world.

Rath

Installing Rath

Since Rath is still in rapid iteration and its documentation isn't very complete, the fastest way to experience it is through the demo website provided on the official site.

However, if you know some frontend technology, you can still install it locally, though the steps are slightly more cumbersome.

Before starting, ensure you have Node.js 16 and Yarn installed.

Practical Data Analysis: Building a Self-Service Data Analytics Platform with Open Source Superset

· 7 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

Data Analytics and Business Intelligence are important business modules for many enterprises to implement digital strategies. We previously introduced indispensable parts of the data field in 《浅谈数据:数据领域需要掌握些什么?》, namely software tool auxiliary services in architecture and processes. The open source data analysis platform Apache Superset introduced in this article can provide such services. This article will briefly introduce how to install, deploy, and use Superset.

Superset Official Site

Superset Introduction

Superset is an open source self-service data analytics platform incubated by the Apache Foundation. It can be seen as an open source version of Power BI or Tableau, though Superset's interactive interface is limited to Web. The entire system is based on Python Flask and integrates with mainstream relational databases like MySQL, Postgres, SQL Server, as well as modern databases like ElasticSearch, ClickHouse, Snowflake. The frontend visualization analysis interface is very similar to Power BI and Tableau, with relatively simple operations. Therefore, if you need to build an enterprise-level data analytics platform like Power BI or Tableau without spending money, Superset is an excellent choice.

Superset Dashboard

CI/CD in Action: Manage auto builds of large open-source projects with GitHub Actions?

· 5 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

In the previous article about CI/CD in Action: How to use Microsoft's GitHub Actions in a right way?, we introduced how to use GitHub Actions workflows with a practical Python project. However, this is quite simple and not comprehensive enough for large projects.

This article introduces practical CI/CD applications with GitHub Actions of my open-source project Crawlab. For those who are not familiar with Crawlab, you can refer to the official site or documentation. In short, Crawlab is a web crawler management platform for efficient data collection.

Overall CI/CD Architecture

The new version of Crawlab v0.6 split general functionalities into separated modules, so that the whole project is consisted of a few dependent sub-projects. For example, the main project crawlab depends on the front-end project crawlab-ui and back-end project crawlab-core. Higher decoupling and maintainability are the benefits.

Below is the diagram of the overall CI/CD architecture.

CI/CD in Action: How to use Microsoft's GitHub Actions in a right way?

· 6 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

GitHub Actions is the official CI/CD workflow service provided by GitHub. It is aimed at making it easy for open-source project contributors to manage operational maintenance, and enable open-source communities to embrace cloud-native DevOps. GitHub Actions is integrated into most of my open-source projects including Crawlab and ArtiPub. As a contributor, I think GitHub Actions is not only easy to use, but also free (which is the most important). Therefore, I hope this article will allow open-source project contributors who are not familiar with GitHub Actions, to really get ideas on how to utilize it and make an impact.

Starting from documentation

For those who are not familiar with GitHub Actions, it is strongly recommended that you read the official documentation first, where you can find Introduction Video, Quick Start, Examples, concepts, how it works, etc. If you read through the docs, you can easily do GitHub DevOps with your own experience in CI/CD. References of all codes in this article can be found on the official documentation,

GitHub Actions Docs

Ideas

Let's first figure out what we would like to implement, i.e. using GitHub Actions to run a web crawler to get daily ranking from GitHub Trending.

Talking Testing: the love and hate of Unit Tests

· 4 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

"No code is the best way to write secure and reliable applications."--Kelsey Hightower

Many developers have probably more or less heard of Unit Tests, and even written one and are familiar with it. However, in the volatile and fast changing environment, unit tests seem to be in an embarrassed situation. Developers know it is useful, but treat it with neglect. "The schedule is tight. What time do we have for unit tests?" Does it sound familiar?

What is Unit Test?

Unit Test is some lines of code written by developers to validate whether their own functional codes can run as expected. If the code is not passed, it means the functional codes are problematic.

This self-testing method looks self-deceiving, similar to taking an exam with official answers. In validation area, this term is called White Box Test. The counterpart of White Box Test is Black Box Test which uses other methods to validate things. Unit Test is White Box Test, and higher-level testing methods such as Integration Test, End to End Test, and UI Test are all Black Box Tests. Unit Tests only test code itself.

Testing Pyramid

What are the benefits of unit testing?

Unit testing is a very useful tool in Agile Development. Some agile frameworks, such as eXtreme Programming (XP), requires that every feature must be covered by unit test cases. My previous article Talking Agile: Are you sure your team is practicing Agile properly mentioned the importance of unit tests.