DownloadCase Study: Resource-Aware Batch Web Crawler
???????, ??????? ??? ????-???????????? crawler ??? ??????????? ?? ????? ??????? ??????? ?? ???? ?????? ??? ??????????
???? ? ?????? ?????????? ??????? ??? ?? Ascoos OS ?????? ?? ?????????? ???? ??????????, scalable web crawler ??? ??? ??????????? ?? server ????? ??? ???? ?????? ?? shared hosting ? VPS ?? ?????????????? ??????. ?? script ???????????? real-time CPU + RAM, ??????????? ?light mode? ???? ?????????? ??? ?????????? ???????? ?? quota control ? ??? ?? 3 ???? ??????? ??? kernel.
??????
-
?????? ?????????? ????? ?????????? ???? ?? crawling
-
???????? ???????? ?????? full ??? light mode
-
??????? ??????? ??????? ?? quota (100 MB default)
-
???????? ???????? JSON ??? Web5 indexing / AI training
-
Zero memory leaks (? Free() calls)
?????? ??????? ??? Ascoos OS ??? ????????????????
| ????? | ????? |
|------------------------------|-----------------------------------------------------------------------|
| TCoreSystemHandler | ????????????? CPU load & memory usage ?? real-time |
| TWebsiteHandler | ??????? ??????????????, ??????? load time, ???? HTML, ??????? keywords |
| TFilesHandler | ?????????? ???????, ??????? quota, ??????? ??????? JSON ???????? |
| $utf8 (global helper) | ??????? UTF-8 substring ??? excerpts |
???? ???????
examples/
??? case-studies/
??? websites/
??? resource_aware_batch_crawler/
??? resource_aware_batch_crawler.php ? ?? ?????? ????? ??? ???????
?????? ?????????:
https://github.com/ascoos/os/blob/main/examples/case-studies/websites/resource_aware_batch_crawler/resource_aware_batch_crawler.php
??????????????
-
PHP ? 8.2
-
????????????? Ascoos OS 26 ? AWES 26 (https://awes.ascoos.com)
??? ????????? (????-????)
-
????????? thresholds CPU 70 % ??? Memory 80 %.
-
??? ???? URL ??? ??????:
- ???????????? ?????? CPU + RAM load
- ?? ?????? ???? ?????????? ? ?????????????? light mode
-
????? ???????????:
- `checkAvailability()`
- `analyzeLoadTime()`
-
???? ?? normal mode ???????????:
- `getHTMLContent()`
- `extractKeywords()` (??? sentiment/analysis)
-
???????????? ????????? report ?? JSON ?? timestamp
-
???????? cleanup ?? `Free()` ??? zero memory leaks
?????????? ?????? (???????????)
$cpuLoad = $system->get_cpu_load(0);
$memLoad = $system->get_memory_stats()['percent'];
$lightMode = $cpuLoad > 70 || $memLoad > 80;
if (!$lightMode) {
$content = $website->getHTMLContent($url);
$keywords = $website->extractKeywords($url);
} else {
$content = ['light_mode' => true, 'basic' => $loadTime];
}
$files->writeToFileWithCheck(
json_encode($reports, JSON_PRETTY_PRINT | JSON_UNESCAPED_UNICODE),
$reportFile
);
??????????? ?????????? (?????????? output)
[
{
"url": "https://ascoos.com",
"cpu_load": 34.5,
"mem_load": 62.1,
"light_mode": false,
"availability": true,
"load_time": 0.842,
"content_excerpt": "<!DOCTYPE html><html lang=\"el\">..."
},
{
"url": "https://example.com",
"cpu_load": 88.2,
"mem_load": 91.4,
"light_mode": true,
"availability": true,
"load_time": 0.317,
"content_excerpt": { "light_mode": true, "basic": 0.317 }
}
]
? ???????????? ?? batch_crawl_20251127_142310.json ???? ??? /tmp/crawl_reports/
?????
-
?????????? Ascoos OS ? https://docs.ascoos.com/os (??? ?????????)
-
??????? ?????????? ? https://os.ascoos.com (??? ?????????)
-
AWES Studio (online IDE) ? https://awes.ascoos.com
-
GitHub Repository ? https://github.com/ascoos/os
??????????
???????? ??:
- ?????????? ????????????? ???????? (?.?. disk I/O)
- ?????????? ???????????? ???????? ?? TThreadHandler
- ???????????? TNeuralNetworkHandler ??? semantic scoring URLs
??????? ? CONTRIBUTING-GR.md
????? ??????
Ascoos General License (AGL) ? ????? LICENSE-GR.md
|