Complete Tutorial: Scraping Image Captions from Tensor.Art
Complete Tutorial: Scraping Image Captions from Tensor.ArtThe goal of this tutorial is to automatically grab all the captions from your image dataset on Tensor.Art and save them into individual .txt files for each image, ready to be used for LoRA training.This process is divided into two main parts:Part 1: Extracting all unique captions from the web page into a single text file using JavaScript.Part 2: Splitting that single text file into many separate .txt files using PowerShell.Part 1: Extracting All Captions from the WebsiteIn this section, we will copy all unique captions from the web page to your clipboard.Step 1: Prepare the Web PageOpen your Chrome browser and navigate to your Tensor.Art dataset page containing the images.CRUCIAL STEP: Slowly scroll down the page until ALL of the images in your dataset (e.g., all 63 images) have appeared and loaded on the screen. If you don't do this, the script will only capture captions from the visible images.Step 2: Open the Developer Tools ConsoleOnce all images are loaded, press the F12 key on your keyboard to open the Developer Tools.In the Developer Tools window that appears, click on the "Console" tab.Step 3: Run the JavaScript ScriptCopy the entire code block below:// 1. Grab ALL <p> elements inside the caption divs.
const allCaptionPTags = document.querySelectorAll('.train-model-assets-image-tags p');
// 2. Create an empty array to hold the texts.
let duplicatedCaptionsList = [];
// 3. Loop through each element, CLEAN the text, then add it to the list.
allCaptionPTags.forEach(pTag => {
// Get the raw text
const rawText = pTag.innerText;
// CLEAN THE TEXT: Replace all sequences of whitespace with a single space,
// and then remove leading/trailing spaces.
const cleanedText = rawText.replace(/\s+/g, ' ').trim();
// Push the cleaned text into the list.
duplicatedCaptionsList.push(cleanedText);
});
// 4. Create a 'Set' from the list of cleaned text to automatically remove duplicates.
const uniqueCaptions = [...new Set(duplicatedCaptionsList)];
// 5. Join the unique captions into one large text block, separated by new lines.
const finalText = uniqueCaptions.join('\n');
// 6. Copy the result directly to the clipboard.
copy(finalText);
// 7. Display a confirmation message with the correct count.
console.log(`Total cleanup successful! Exactly ${uniqueCaptions.length} unique captions have been copied to your clipboard.`);Return to the Console window in your browser, then paste the code.Press Enter.You will see a confirmation message in the console stating the number of unique captions that were successfully copied, for example: Total cleanup successful! Exactly 63 unique captions have been copied to your clipboard.Step 4: Save the Results to a Text FileCreate a new folder on your computer to store your dataset. For example: D:\LoraTraining.Open the Notepad application.Press Ctrl + V to paste all the copied captions.Click File > Save As....Navigate to the folder you just created (e.g., D:\LoraTraining).Save the file with the name e.g., caption.txt.You now have a single file containing all unique captions, each on a new line.Part 2: Splitting the caption.txt File into Individual FilesIn this section, we will use PowerShell (a built-in tool in Windows) to automatically create one .txt file for each line of text in caption.txt.Step 1: Open PowerShell in the Working FolderOpen the folder where you saved caption.txt (e.g., D:\LoraTraining).Inside the folder (not on a file), hold down the Shift key on your keyboard and right-click on an empty space.Select the "Open PowerShell window here" or "Open in Terminal" option from the context menu.Step 2: Run the PowerShell ScriptA blue (PowerShell) or black (Terminal) window will appear. Copy the entire code block below:# 1. Define the input file name and the output file format
$inputFile = "caption.txt" # Customize with your file name.
$outputPrefix = "image" # The result will be image_1.txt, image_2.txt, etc.
# 2. Read all lines from the caption.txt file
$captions = Get-Content $inputFile
# 3. Create a counter
$i = 1
# 4. Loop through each caption line
foreach ($line in $captions) {
# Make sure the line is not empty
if ($line.Trim() -ne "") {
# Create the new file name, e.g., image_1.txt
$outputFile = "${outputPrefix}_${i}.txt"
# Write the line's content to the new file
Set-Content -Path $outputFile -Value $line
# Increment the counter
$i++
}
}
# 5. Display a completion message
Write-Host "Done! Successfully created $($i-1) .txt files." Paste the code into the PowerShell window.Press Enter.Step 3: Verify the ResultInstantly, your D:\LoraTraining folder will be populated with many new files: image_1.txt, image_2.txt, image_3.txt, ..., all the way to image_63.txt. Each of these files contains its corresponding single-line caption.