Blog | Marvin Zhang

On Generative AI Technology: Retrieval-Augmented Generation (RAG)

October 1, 2023 · 4 min read

Software Engineer & Open Source Enthusiast

Introduction

Nowadays, generative AI applications are emerging like mushrooms after rain, overwhelming in their abundance. Large Language Models (LLMs) have become exceptionally popular with the release of ChatGPT and are a typical example of generative AI applications. However, LLMs have flaws. One significant problem is hallucination: for unfamiliar questions, LLMs fabricate answers that appear professional but have no factual basis. To solve this problem, many AI-based knowledge Q&A systems adopt Retrieval-Augmented Generation (RAG) technology, enabling LLMs to provide fact-based answers and eliminate hallucinations. This article will briefly introduce how RAG works in knowledge Q&A systems.

LLMs

To understand RAG, we first need to briefly understand LLMs. Actually, through extensive parameter training, LLMs can already complete many incredible NLP tasks, such as Q&A, writing, translation, code understanding, etc. However, since LLM "memory" remains at the pre-training moment, there will definitely be knowledge and questions it doesn't know. For example, ChatGPT developed by OpenAI cannot answer questions after September 2021. Additionally, due to the existence of hallucinations, LLMs appear very imaginative but lack factual basis. Therefore, we can compare LLMs to knowledgeable and versatile sages who can do many things but have amnesia, with memories only staying before a certain time and unable to form new memories.

To help this sage achieve high scores in modern exams, what should we do? The answer is RAG.

Practical Data Science: How to Easily Rank in Kaggle Beginner NLP Competition Using sklearn

June 3, 2023 · 6 min read

Marvin Zhang

Software Engineer & Open Source Enthusiast

Introduction

Kaggle is an online community and data science competition platform for data scientists, machine learning engineers, and data analysts, featuring many rewarded data science competitions and datasets. The Kaggle community is very famous in the data science field, with many major internet companies publishing rewarded competitions with prizes ranging from tens of thousands to millions of dollars. This article introduces a recent participation in a Kaggle beginner NLP competition, which has no cash rewards but allows learning NLP-related machine learning knowledge.

Kaggle Competition

Competition Overview

This data science competition asks participants to determine whether a tweet is about a real disaster based on a given Twitter tweet. The image below shows a particular tweet containing the keyword "ABLAZE," indicating the tweet is about a house catching fire.

Disater Tweet

Practical Data Analysis: Open Source Automated Data Exploration Tool Rath

May 21, 2023 · 5 min read

Marvin Zhang

Software Engineer & Open Source Enthusiast

Introduction

Exploratory Data Analysis (EDA) is a task that data analysts or data scientists frequently need to complete when facing datasets. Using Python tools like Pandas and Seaborn can easily accomplish univariate analysis, bi-variate analysis, and multi-variate analysis, but using them for data exploration not only has certain technical barriers but also requires manually writing scripts for data operations and analysis. This article will introduce a very cool automated data exploration open source tool Rath, which can automatically complete EDA and become the Autopilot or Copilot of the data analysis world.

Rath

Installing Rath

Since Rath is still in rapid iteration and its documentation isn't very complete, the fastest way to experience it is through the demo website provided on the official site.

However, if you know some frontend technology, you can still install it locally, though the steps are slightly more cumbersome.

Before starting, ensure you have Node.js 16 and Yarn installed.

Web Crawler in Action: How to use Webspot to implement automatic recognition and data extraction of list web pages

April 8, 2023 · 7 min read

Marvin Zhang

Software Engineer & Open Source Enthusiast

Introduction

Using web crawling programs to extract list web pages is a one of those common web data extraction tasks. For engineers to write web crawlers, how to efficiently code and generate extraction rules is quite necessary, otherwise most of the time can be wasted on writing CSS selectors and XPath data extraction rules of web crawling programs. In light of this issue, this article will introduce an example of using open source tool Webspot to automatically recognize and extract data of list web pages.

Webspot

Webspot is an open source project aimed at automating web page data extraction. Currently, it supports recognition and crawling rules extraction of list pages and pagination. In addition, it provides a web UI interface for users to visually view the identified results, and allows developers to use APIs to obtain recognition results.

Installation of Webspot is quite easy, you can refer to the official documentation for the installation tutorial with Docker and Docker Compose. Execute the commands below to install and start Webspot.

# clone git repo
git clone https://github.com/crawlab-team/webspot

# start docker containers
docker-compose up -d

Talking Algorithm: Exploration of Intelligent Web Crawlers

March 25, 2023 · 8 min read

Marvin Zhang

Software Engineer & Open Source Enthusiast

Introduction

"If I had asked people what they wanted, they would have said faster horses" -- Henry Ford

Today is the era of artificial intelligence. Whether it is ChatGPT or the various intelligent applications that follow, many people see the upcoming sci-fi world that was almost unimaginable a few years ago. However, in the field of reptiles, artificial intelligence does not seem to be involved too much. It is true that crawlers, as an "ancient" technology, have created many technical industries such as search engines, news aggregation, and data analysis in the past 20 years, but we have not seen obvious technological breakthroughs yet: crawler engineers still mainly rely on technologies such as XPath and reverse engineering to automatically obtain web data. However, with the development of artificial intelligence and machine learning, crawler technology can theoretically achieve "automatic driving". This article will introduce the current status and possible future development direction of the so-called intelligent crawler (intelligent, automated data extraction crawler technology) from multiple perspectives.

Current Web Crawling Technology

A web crawler is an automated program used to obtain data from the Internet or other computer networks. They usually use automated scraping techniques to automatically visit the website and collect, parse and store information on the website. This information can be structured or unstructured data.

Crawler technology in the traditional sense mainly includes the following modules or systems:

Network request : initiate an HTTP request to a website or web page to obtain data such as HTML;
Web page parsing : parse HTML to form a structured tree structure, and obtain target data through XPath or CSS Selector;
Data storage : store the parsed structured data, which can be in the form of a database or a file;
URL management : manage the URL list to be crawled and the URL list that has been crawled, such as URL resolution and request for paging or list pages.

web crawling system

On Theory: Why Graph Theory is Essential Knowledge for All Industries Today

January 18, 2023 · 5 min read

Marvin Zhang

Software Engineer & Open Source Enthusiast

Introduction

"Entities should not be multiplied without necessity" -- Ockham's Razor Principle

Graph Theory is a mathematical foundational theory that has been severely underestimated by the public. It doesn't study images, pictures, or charts, but rather an abstract and simple mathematical theory. The graph in graph theory is an abstract concept, very similar to a relationship network, with corresponding nodes (or vertices), and associative relationships or edges between nodes. Graph theory concepts are very simple: graphs, nodes, and edges. This article will briefly introduce basic concepts of graph theory and its applications in the real world. (Note! This is not a scientific paper, so there won't be boring mathematical formulas - please enjoy reading)

graph

Graph Theory Overview

In graph theory, there are three important concepts:

Node: Can be understood as an entity, such as Zhang San, Li Si, Wang Wu in a relationship network;
Edge: Can be understood as relationships between entities, for example, Zhang San and Li Si are husband and wife, Wang Wu is their son;
Graph: Can be understood as the collection of all nodes and edges, such as the happy family composed of Zhang San, Li Si, and Wang Wu.

From these three basic concepts, we can infer relationships between nodes. For example, Li Si's older brother Li Yi would be Wang Wu's uncle, and Wang Wu would be his nephew.

Practical Data Analysis: Building a Self-Service Data Analytics Platform with Open Source Superset

January 11, 2023 · 7 min read

Marvin Zhang

Software Engineer & Open Source Enthusiast

Introduction

Data Analytics and Business Intelligence are important business modules for many enterprises to implement digital strategies. We previously introduced indispensable parts of the data field in 《浅谈数据：数据领域需要掌握些什么？》, namely software tool auxiliary services in architecture and processes. The open source data analysis platform Apache Superset introduced in this article can provide such services. This article will briefly introduce how to install, deploy, and use Superset.

Superset Official Site

Superset Introduction

Superset is an open source self-service data analytics platform incubated by the Apache Foundation. It can be seen as an open source version of Power BI or Tableau, though Superset's interactive interface is limited to Web. The entire system is based on Python Flask and integrates with mainstream relational databases like MySQL, Postgres, SQL Server, as well as modern databases like ElasticSearch, ClickHouse, Snowflake. The frontend visualization analysis interface is very similar to Power BI and Tableau, with relatively simple operations. Therefore, if you need to build an enterprise-level data analytics platform like Power BI or Tableau without spending money, Superset is an excellent choice.

Superset Dashboard

Brief Discussion on Architecture: Why Do We Need to Consider Complexity in Software Projects?

December 9, 2022 · 7 min read

Marvin Zhang

Software Engineer & Open Source Enthusiast

Introduction

Complexity is an eternal challenge in software engineering. As project scale grows, complexity increases exponentially, and if left uncontrolled, it can ultimately lead to project failure.

In the world of software development, complexity is everywhere. From simple "Hello World" programs to large-scale distributed systems, complexity always accompanies our development process. As software architects and technical leaders, understanding the nature of complexity, its sources, and how to manage it is a core skill we must master.

Talking Architecture: What skills should architects have apart from drawing architecture diagrams?

November 14, 2022 · 5 min read

Marvin Zhang

Software Engineer & Open Source Enthusiast

Introduction

"Architecture is about the important stuff... whatever it is." -- Ralph Johnson

Architect is a position with powerfulness and respect. When you hear that someone is an architect of a company, will you feel a sense of awe? Architects are usually believed to be relevant to system design, technical strength, leadership and influence. It is precisely for this reason that many of the positions of architects in the enterprise are held by experienced and skilled senior software engineers. However, the definition of an architect in the software industry is not quite clear: cloud service providers such as Amazon and Alibaba Cloud have their own teams of architects, but most of them provide after-sales service to customers under the name "architect"; Some architects are nothing more than using his rich experience and excellent strength to solve technical problems, which is equivalent to what a senior software engineer can do. These are very different from the omniscient architect who designs the architecture diagram.

The concepts related to architects in this article mainly come from a book Fundamentals of Software Architecture (authored by Mark Richards, Neal Ford) that I have been reading recently. This article will briefly introduce about what a pragmatic architect should do and required skills.

Fundamentals of Software Architecture

Chief Engineer

First of all, an architect should be a chief engineer of the entire software project, who is responsible for the overall design, implementation and quality of the software project. Therefore, for software architects, they need to have excellent programming skills, good understanding of the software project development process, and a certain breadth and depth in various technical areas. Not only that, because the architect is the chief person in charge of the technology side, he (or she) usually needs to think it through, from a perspective of the system as a whole, about how the various modules interact, whether the division of functional services is reasonable, where the bottleneck of the whole system will be, and so on. These are all in the technical area.

Go Project Source Code Analysis: Schedule Job Library "cron"

November 11, 2022 · 5 min read

Marvin Zhang

Software Engineer & Open Source Enthusiast

Introduction

There are many excellent open-source projects on GitHub, where the code is transparent and available to everyone. As software developer, we can learn a lot from them including software engineering, unit testing, coding style standardization, etc. We can even find issues by looking into their code, and submit pull requests to contribute to tech communities. Today we are going to dig into the source code of a popular Golang open-source project on GitHub, robfig/cron, which is small and with clear annotations, and is very suitable for new developers to learn how to read and analyze source code.

Environment Preparation

First, we are going to fork the project into our personal repo. We can do so by clicking the Fork button and enter the project name. After it's forked, we can either download to our local, or click Create codespace on master to create a Codespace on the GitHub repo's home page. Codespace is a Azure-based GitHub service that allows developers to develop remotely, and it is now available to individuals. Let's give it a try!

Create Codespace

After clicking, a new page will be opened in the browser, where there is a web interface of VS Code, displaying the directory, code and terminal of the project. See the figure below.

Introduction​

LLMs​

Introduction​

Competition Overview​

Introduction​

Installing Rath​

Introduction​

Webspot​

Introduction​

Current Web Crawling Technology​

Introduction​

Graph Theory Overview​

Introduction​

Superset Introduction​

Introduction​

Introduction​

Chief Engineer​

Introduction​

Environment Preparation​

Introduction

LLMs

Introduction

Competition Overview

Introduction

Installing Rath

Introduction

Webspot

Introduction

Current Web Crawling Technology

Introduction

Graph Theory Overview

Introduction

Superset Introduction

Introduction

Introduction

Chief Engineer

Introduction

Environment Preparation