Skip to main content

18 posts tagged with "AI"

Artificial Intelligence and machine learning

View All Tags

Exploring Crawlab: Your New Enterprise Web Scraping Management Choice

· 3 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

In the modern data-driven era, acquiring and managing online information has become crucial. To provide powerful support for enterprises and developers, Crawlab has emerged as an enterprise-level web scraping management platform characterized by being ready-to-use out of the box. Regardless of your team size, Crawlab can provide professional and efficient web scraping management solutions.

Core Features

Crawlab's core features include distributed system management, spider task management and scheduling, file editing, message notifications, dependency management, Git integration, and performance monitoring, among others. Its distributed node management allows spider programs to run efficiently across multiple servers. No more worrying about manual uploading, monitoring, and deployment hassles - Crawlab automates all of this, ensuring you can easily schedule spider tasks and view spider program running status and task logs in real-time.

Spider List

Key Highlights

Unleash Your Reading Potential: Embark on a New Intelligent Reading Experience with SRead

· 3 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

"Reading is an adventure of the mind, and knowledge is the fuel for the soul."

In today's information explosion, reading has become an indispensable part of each of us. However, traditional reading methods often drown us in a sea of information, making it hard to discern what knowledge is genuinely useful. Against this backdrop, SRead comes into being.

What is SRead?

SRead is an AI-based reading assistant specially designed to enhance your reading experience. It is not just an e-book reader but also your personal reading advisor and assistant.

SRead

Intelligent Q&A: Answers On-Demand

No longer need to search online or refer to other materials when interrupted during reading; SRead's Intelligent Q&A feature can instantly answer any questions related to the content or topic at hand.

On Generative AI Technology: Retrieval-Augmented Generation (RAG)

· 4 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

Nowadays, generative AI applications are emerging like mushrooms after rain, overwhelming in their abundance. Large Language Models (LLMs) have become exceptionally popular with the release of ChatGPT and are a typical example of generative AI applications. However, LLMs have flaws. One significant problem is hallucination: for unfamiliar questions, LLMs fabricate answers that appear professional but have no factual basis. To solve this problem, many AI-based knowledge Q&A systems adopt Retrieval-Augmented Generation (RAG) technology, enabling LLMs to provide fact-based answers and eliminate hallucinations. This article will briefly introduce how RAG works in knowledge Q&A systems.

LLMs

To understand RAG, we first need to briefly understand LLMs. Actually, through extensive parameter training, LLMs can already complete many incredible NLP tasks, such as Q&A, writing, translation, code understanding, etc. However, since LLM "memory" remains at the pre-training moment, there will definitely be knowledge and questions it doesn't know. For example, ChatGPT developed by OpenAI cannot answer questions after September 2021. Additionally, due to the existence of hallucinations, LLMs appear very imaginative but lack factual basis. Therefore, we can compare LLMs to knowledgeable and versatile sages who can do many things but have amnesia, with memories only staying before a certain time and unable to form new memories.

To help this sage achieve high scores in modern exams, what should we do? The answer is RAG.

Practical Data Science: How to Easily Rank in Kaggle Beginner NLP Competition Using sklearn

· 6 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

Kaggle is an online community and data science competition platform for data scientists, machine learning engineers, and data analysts, featuring many rewarded data science competitions and datasets. The Kaggle community is very famous in the data science field, with many major internet companies publishing rewarded competitions with prizes ranging from tens of thousands to millions of dollars. This article introduces a recent participation in a Kaggle beginner NLP competition, which has no cash rewards but allows learning NLP-related machine learning knowledge.

Kaggle Competition

Competition Overview

This data science competition asks participants to determine whether a tweet is about a real disaster based on a given Twitter tweet. The image below shows a particular tweet containing the keyword "ABLAZE," indicating the tweet is about a house catching fire.

Disater Tweet

On Theory: Why Graph Theory is Essential Knowledge for All Industries Today

· 5 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

"Entities should not be multiplied without necessity" -- Ockham's Razor Principle

Graph Theory is a mathematical foundational theory that has been severely underestimated by the public. It doesn't study images, pictures, or charts, but rather an abstract and simple mathematical theory. The graph in graph theory is an abstract concept, very similar to a relationship network, with corresponding nodes (or vertices), and associative relationships or edges between nodes. Graph theory concepts are very simple: graphs, nodes, and edges. This article will briefly introduce basic concepts of graph theory and its applications in the real world. (Note! This is not a scientific paper, so there won't be boring mathematical formulas - please enjoy reading)

graph

Graph Theory Overview

In graph theory, there are three important concepts:

  1. Node: Can be understood as an entity, such as Zhang San, Li Si, Wang Wu in a relationship network;
  2. Edge: Can be understood as relationships between entities, for example, Zhang San and Li Si are husband and wife, Wang Wu is their son;
  3. Graph: Can be understood as the collection of all nodes and edges, such as the happy family composed of Zhang San, Li Si, and Wang Wu.

From these three basic concepts, we can infer relationships between nodes. For example, Li Si's older brother Li Yi would be Wang Wu's uncle, and Wang Wu would be his nephew.

Practical Data Analysis: Building a Self-Service Data Analytics Platform with Open Source Superset

· 7 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

Data Analytics and Business Intelligence are important business modules for many enterprises to implement digital strategies. We previously introduced indispensable parts of the data field in 《浅谈数据:数据领域需要掌握些什么?》, namely software tool auxiliary services in architecture and processes. The open source data analysis platform Apache Superset introduced in this article can provide such services. This article will briefly introduce how to install, deploy, and use Superset.

Superset Official Site

Superset Introduction

Superset is an open source self-service data analytics platform incubated by the Apache Foundation. It can be seen as an open source version of Power BI or Tableau, though Superset's interactive interface is limited to Web. The entire system is based on Python Flask and integrates with mainstream relational databases like MySQL, Postgres, SQL Server, as well as modern databases like ElasticSearch, ClickHouse, Snowflake. The frontend visualization analysis interface is very similar to Power BI and Tableau, with relatively simple operations. Therefore, if you need to build an enterprise-level data analytics platform like Power BI or Tableau without spending money, Superset is an excellent choice.

Superset Dashboard

CI/CD in Action: How to use Microsoft's GitHub Actions in a right way?

· 6 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

GitHub Actions is the official CI/CD workflow service provided by GitHub. It is aimed at making it easy for open-source project contributors to manage operational maintenance, and enable open-source communities to embrace cloud-native DevOps. GitHub Actions is integrated into most of my open-source projects including Crawlab and ArtiPub. As a contributor, I think GitHub Actions is not only easy to use, but also free (which is the most important). Therefore, I hope this article will allow open-source project contributors who are not familiar with GitHub Actions, to really get ideas on how to utilize it and make an impact.

Starting from documentation

For those who are not familiar with GitHub Actions, it is strongly recommended that you read the official documentation first, where you can find Introduction Video, Quick Start, Examples, concepts, how it works, etc. If you read through the docs, you can easily do GitHub DevOps with your own experience in CI/CD. References of all codes in this article can be found on the official documentation,

GitHub Actions Docs

Ideas

Let's first figure out what we would like to implement, i.e. using GitHub Actions to run a web crawler to get daily ranking from GitHub Trending.

Beyond Gantt Charts: What Software Project Management Knowledge You Should Know

· 17 min read
Marvin Zhang
Software Engineer & Open Source Enthusiast

Introduction

A bad plan is better than no plan.

坏计划也好过没有计划。--彼得·蒂尔《从0到1》

In software development engineering, there are rarely lone wolf programmers. This is because modern, commonly seen software projects are usually very complex, requiring substantial human resources, resources, and time. Having a single developer complete a large software project alone would be like "an old man moving mountains." Therefore, software development is inseparable from team collaboration and project management. Project Management, simply put, is a methodology for orderly organizing, planning, executing, and completing various tasks in a project. Of course, the actual scope of project management goes far beyond this, usually involving resource allocation, priority setting, progress tracking, etc. It's a product of the Industrial Revolution and a branch of modern management science that can significantly improve engineering completion efficiency and success rates. This article mainly discusses software project management, which is very different from traditional project management in construction engineering, mechanical engineering, etc. Early IT project management borrowed from traditional project management methodologies like construction engineering, playing an important role in the early information age and significantly improving software development and collaboration efficiency. However, with the rapid development of the IT industry, consumer product demands change rapidly, and market conditions have become increasingly volatile. Traditional software project management models can no longer meet software development needs. Therefore, modern software development models, such as Agile Development, emerged and became the preferred choice for many internet companies.

What are the drawbacks of traditional project management models (such as waterfall)? What improvements do modern project management models (such as agile) offer? Should we completely abandon waterfall models and fully embrace agile development? As a programmer, should you master some project management knowledge and related tools? As a team leader, how should you establish project management processes to ensure development efficiency and quality? If readers have similar questions, this article will provide detailed analysis and answers.